Tuesday, April 15, 2014

"Big Data" is a qualified noun.

So its official. We have entered the trough of disillusionment for big data after this New York Times editorial went viral.
Now its time to laugh at anyone and everyone who ever tried to set up Hadoop clusters on their virtual machines hoping to put a ding in the universe. Let me say the signs were already there after no less a person than Michael Stonebraker had warned of an impending doom (he also makes similar strong claims in a very popular MIT Professional Education course on big data).
But obviously not many will agree (or disagree) with the merits of Professor Stonebraker's arguments. And its not the purpose of this post either. However being a developer I can guess where things usually go wrong from a developer's perspective when succumbing to the temptation of big data. So this post is more from the viewpoint of a developer rather than that of a data scientist.
Since Hadoop is primarily the brainchild of Java community (although its use is rather ubiquitous across the spectrum) its surprising how the very basic ideas like abstraction and layering we learnt as Java developers tend to get forgotten in the excitement of using Hadoop and its ecosystem.
I once worked with a start up that decided to build stuff leveraging big data. As is the norm, they coded most of their business logic in Hadoop's mapper and reducer classes and/or Apache Pig scripts. They also decided to throw a document store MongoDB in the mix even though most of the data they were storing in it was relational and was certainly not too big for MySQL. What was even worse they never tried to build layers to abstract out MongoDB or Hadoop APIs. The end result was predictably disastrous. Their code could not run without Hadoop even when they were processing a few megabytes of data or unit testing in their eclipse environments. When it did run it took almost half an hour to process even a few thousand records so nobody ran it unless they were forced to do so. Their data also had no referential integrity and could not be safely manipulated using transactions and references to MongoDB's API were scattered through out their code right up to the UI.
It made their development a hell because every one needing to run the code had to set up Hadoop and other "big data" stuff even for testing JavaScript. It made communication harder by killing the most effective communication channel i.e. the code: because it was no longer easily runnable. It made their production environment so slow that it was completely unusable for small data entered/uploaded interactively. And to top it, it also created security issues because their data store queries were exposed everywhere even in JavaScript.
What the hell happened?
Here's how @codinghorror, the founder of StackExchange puts it:












Lets be honest for once. Just using Hadoop and MongoDB (and other cool stuff in the market) does not automatically make your architecture highly scalable although it can perhaps make it highly fashionable. The old rules of the game still apply. You MUST build layers to encapsulate every external component or library you use, whether its a database or a messaging system or an execution engine. We, the Java folks, used to build data access layers for Oracle or MySQL and tried to hide SQL queries behind Spring/Hibernate not because SQL was bad or uncool but because over the years we had learnt to avoid tight coupling between our business logic and a particular vendor or product. It allowed us to replace or update most of our technology stack without significant pain making development, testing and production simpler. There is no reason the same should not be done with MongoDB and other NoSQL databases. Actually even more so because each NoSQL database has its own API and it does not even have the abstraction that SQL provided. Who knows you may have to replace MongoDB with Cassandra in your production some day. Make sure you do not have to throw away 70% of your code.
Its also necessary that the business logic running on Hadoop should be separated out into separate strategy classes that can be run and unit tested without the need for Hadoop clusters. It surely makes life easier for a developer and makes the code more communicative and self-explanatory for others. It also helps in the production where it saves one from the embarrassment of running excruciatingly long Hadoop jobs on a few dozen records that can easily be processed on a single node within milliseconds.
And why do you even use MongoDB? As Professor Stonebraker pointed out in his now infamous lectures in the MIT course, you should not use a non-relational database unless your data is really big and really unstructured and does not really require referential integrity and acidity. This effectively means never. Or at-least very rarely. Unfortunately the temptation to use NoSQL is high not only because its cool but also because it saves one from the trouble of creating a well thought-out data model (dump anything in MongoDB and it won't complain). But this laziness does have dangerous consequences. E.F. Codd wasn't an idiot when he advocated the importance of referential integrity and ACID transactions. You better use relational database unless you have a compelling reason not to do so (and no, being cool is not a compelling reason).
So in conclusion, its not bad to build your businesses around big data hype and boast about it. But make sure it is actually big. For new start ups, this may mean to start a little modestly using the conventional technology stack of relational databases and use layering and abstractions to make it loosely coupled. So that once their data grows to the size of terabytes and petabytes, they can easily migrate the same code to Hadoop and other components. Provided, of course, that these components exist when this happens. And provided their data does reach terabytes some day.
In most cases this may never happen.!!
The writer is a renegade software architect turned data scientist who moonlights as a developer.

No comments:

Post a Comment