I jumped at the opportunity to write a chapter for this book, Pro SQL Server 2012 Practices, because of all the fantastic people writing and the topics that were covered. I reviewed Chapter 12 a few weeks ago and I’ve been meaning to get this chapter, Big Data for the SQL Server DBA, read & reviewed but I’ve just been sidetracked. I finally noticed it on my Trello board, weeks out of date, and decided to finally get the job done.
Carlos Bossy (b|t) is a great guy and a part of my SQL Family. We run into each other all the time at SQL Saturday’s and other events. I’ve never listened to him present before, and now, after reading his chapter in the book, I’m very sorry. I’ll fix that at my first opportunity.
Carlos starts off nicely, getting a reasonable definition of “big data.” It’s not merely about size, although that has to play a part. He breaks it down to the three “Vs” volume, variety and velocity. It makes sense. Just having a big database is not the same thing as dealing with big data. He takes the time to differentiate between the structures of a relational system, that work just fine, and the needs of big data, which just don’t fit well in relational storage. It’s as clear a delineation as I’ve seen on this topic.
The big data management system that Carlos then focuses on is Hadoop. Which makes sense again. If you’re already working within the Microsoft stack, they’re moving on Hadoop with the HDInsight servers and stuff on Azure. So learning how the Hadoop File System (HDFS) and MapReduce work together to provide you with massive data movement and storage keeps you within the set of tools with which you’re already familiar (although there is a ton of learning to do here). Carlos covers MapReduce in quite a lot of detail because it really is the driving force behind Hadoop and how you’re going to get your big data into the system. He covers the topic by explaining in general terms how things will work, then walks you through an example with additional detail and then walks you through another example with a lot more detail. The accumulative effect works well. The core concepts generally start to stick.
I really like how Carlos addresses the technology from a business solution perspective. It’s not that any one technology is a glorious thing, let’s go use it. It’s that certain business problems require certain tools in order to solve them most efficiently. He makes an excellent case for DBAs understanding technologies like Hadoop in order to implement them where they are appropriate. It’s the kind of arguments I’ve always tried to support. The concepts for management of big data are both simple and really complex and Carlos attacks them both in order to educate. I’ve learned a lot from reading this chapter.
He then walks through different aspects of the type of work you’ll need to support as a DBA using a Hadoop server. There’s going to be some pretty major work around import/export of the data, into and out of Hadoop yes, but into and out of other types of storage, such as good old fashioned relational SQL Server, data warehouses, cubes, and Excel/Sharepoint. Carlos takes us through a detailed examination of the primary tool for this function within the HDInsight/Hadoop infrastructure in Azure, squoop. It’s a great way to learn how this stuff works.
Finally, Carlos goes over some pieces of the future of big data, again focusing primarily on the Microsoft/SQL Server stack and how it relates to Hadoop, but also SSRS and all the tools with which you’re already familiar.
Overall, it’s a great chapter. It’s very solid introduction to a topic that I’m just starting to really wrestle with myself, so I’m happy to have it as a reference.