Tuesday, April 22, 2014

Internet's silent era must end.

Imagine a talking, conversational and affectionate internet!
Imagine logging into your social network and instead of reading the status updates, you listen to your friends talking in their own voices! Imagine logging into Amazon and be greeted by a salesperson describing the new arrivals you may like!! Imagine logging into Gmail and hearing your boss scolding you for missing a deadline!!!
Hang on! Are we going to talk about the next big thing that will relegate Siri and IBM Watson to the dust heap of history?
Not exactly.
On the contrary adding basic speech support to your application is so simple, it makes one wonder why not many people do it.
Here's how we can go about it. Step by step. And there's no need to fasten seat belts either.
Getting the spoken word
So now we will have to research all that speech recognition literature available on arXiv?
Not really. Most of you must have seen the big text box with a microphone on Google's page that lets you speak your search terms into it. Here's the big secret. Its not some Google's secret proprietary widget, its just a voice input tag added in HTML5 which you can use too (although to give devil its due Google are a big supporter of HTML5).
This text box will automatically convert the spoken words into text with some accuracy (although for non-native speakers it is unnecessarily quirky).
So no need to do research in speech recognition, the browser will convert it to text for us automatically. That was relief because speech recognition is a really complicated area.
Understanding what was asked
So now we have got user's query in text, how do we know what the user is asking?
That depends upon what kind of data your website has and what kind of sophistication you want.
If what you have is text data, you can just create lucene indexes for all that data (or let solr/elasticsearch do it) and then do a simple query with whatever user has uttered. Chances are you will get some useful results. Lucene automatically removes stop words, does stemming and synonymization and even spell correction (which comes in handy for non-native speakers). It does an excellent job of scoring the results and you can also do boosting on certain terms that are relevant to your domain. If you don't want anything more fancy, you can leave now.
On the other hand if your data is structured (or you plan to add some structure to it) where there are some entities which have relations with other entities, it gets tricky.
The simplest approach is to create names and synonyms for all your entities (or tables), their attributes (columns) and their relationships. The result will more likely resemble an RDF graph rather than a relational database. Here's an example: Take a user-role RDBMS schema like this one:

You just have to convert it to something like this:

Its not as scary as it looks. The above image is actually taken from the website of Neo4J which is a cool graph database useful for doing similar things. You also need to convert tokens of the query (after removing stop words) into some sort of graph query. Again Neo4J will be your best friend. Just see this link for an example: User roles in graphs
Clearly if you have a few hundred tables, creating synonyms for all tables/columns/relations will quickly get out of hand. But believe it or not it works in most cases. Most of you must be familiar with Facebook Graph Search. So when we type something like 'People who work at Facebook', Facebook does break the query and run on a graph similar to the one above. Although they claim to do more sophisticated stuff, when someone tried to run the query on an RDF saved in Neo4j his application was banned by Facebook because it was replicating their feature!!!
Its cool, its easy but quite obviously it won't work if some user asks a question like: Who is the highest paid employee in the sales department?
You can't easily convert it to a graph query, A more complicated and elaborate approach is to use Named Entity Recognition (NER). Again there are great open source tools available for your help. One of them is Apache OpenNLP. This link explains how you can train all your entities in OpenNLP to get an NER model and then construct the graph query from it.
We are getting unnecessarily close to a full blown question answering system like IBM Watson which is slightly uncomfortable. So lets end it here with a brief stop over.
Remember our aim is to make only our website accessible via natural language queries, so the scope should be proportional to the content of our website. An actual question answering system like Siri or IBM Watson tries to answer any question a user asks even if its related to cooking. For those of you who really want to build something like that, Manning's Taming Text is an excellent start.
There is another category of applications called chat bots that try to mimic humans by giving answers about random topics. Classically they use a language ostentatiously called the AIML (Artificial Intelligence Markup Language) to define patterns of questions and their answers. A typical AIML file looks like this. You can also use this syntax to create a chat bot that is specific to answering questions about your business or website. For your inspiration let me add that chat bots did rule the AI world in the good old days.
Wait. Instead of teaching my kid the rules of grammar, I will have to teach her the answer to every possible question she may encounter in her life?
Samantha: I can understand how the limited perspective of an unartificial mind might perceive it that way. You'll get used to it.
Finishing it off
So we have just got a table of results by asking a query instead of typing it. Is that why I wasted half an hour?
Calm down. Yes the table of results is what we will get but its indeed not very polite to show it to someone who has made the effort to make a verbal inquiry. We should pay him back in the same coin, i.e. we should announce the results articulately instead of showing an ugly table of results.
Thankfully text-to-speech conversion is far easier than speech recognition so we can even try to do it ourselves.
But wait. Ever heard of Java Speech API? Yes such a thing does exist and is a living relic of the long dead Java community process. Java Speech API lets you turn speech to text literally. It also has a popular open source implementation in FreeTTS.
Here's how a typical interaction with FreeTTS looks.
Voice alan = VoiceManager.getInstance().getVoice("alan");
alan.speak("Hi there. I am Alan.");
I am not joking. Thats how simple it is. Furthermore you can also choose from a set of male and female voices and can even use your own voice!!!
So all we need to do now, is to convert our results into some meaningful text and utter it. Just blithering them out will surely make us look stupid.
Thankfully FreeTTS also allows us to convert the audio into mp3/wav files and HTML5 has another widget for playing these audio files. So all we need is to stream our results back to the patiently waiting user. Or their browser.
And that's all folks.

OMG. Our website really talks back. We have just built the next big thing that will relegate Siri and...
Here are a few resources you can study to see how broad this topic actually is:
  1. Deep QA: A description of IBM Watson.
  2. Taming Text: A great book by Manning on text processing.
  3. Apache Lucene: The most popular text search library.
  4. Solr and Elasticsearch: Doing natural language search on top of lucene.
  5. AIML: Artificial Intelligence Markup Language.
  6. AIML Sets: Example AIML files for various topics.
  7. ProgramD: An open source chat bot written in Java.
  8. OpenNLP Manual: Manual of Apache OpenNLP.
  9. FreeTTS: An open source text to speech engine.
The writer is a renegade software architect turned data scientist who moonlights as a developer.

Tuesday, April 15, 2014

"Big Data" is a qualified noun.

So its official. We have entered the trough of disillusionment for big data after this New York Times editorial went viral.
Now its time to laugh at anyone and everyone who ever tried to set up Hadoop clusters on their virtual machines hoping to put a ding in the universe. Let me say the signs were already there after no less a person than Michael Stonebraker had warned of an impending doom (he also makes similar strong claims in a very popular MIT Professional Education course on big data).
But obviously not many will agree (or disagree) with the merits of Professor Stonebraker's arguments. And its not the purpose of this post either. However being a developer I can guess where things usually go wrong from a developer's perspective when succumbing to the temptation of big data. So this post is more from the viewpoint of a developer rather than that of a data scientist.
Since Hadoop is primarily the brainchild of Java community (although its use is rather ubiquitous across the spectrum) its surprising how the very basic ideas like abstraction and layering we learnt as Java developers tend to get forgotten in the excitement of using Hadoop and its ecosystem.
I once worked with a start up that decided to build stuff leveraging big data. As is the norm, they coded most of their business logic in Hadoop's mapper and reducer classes and/or Apache Pig scripts. They also decided to throw a document store MongoDB in the mix even though most of the data they were storing in it was relational and was certainly not too big for MySQL. What was even worse they never tried to build layers to abstract out MongoDB or Hadoop APIs. The end result was predictably disastrous. Their code could not run without Hadoop even when they were processing a few megabytes of data or unit testing in their eclipse environments. When it did run it took almost half an hour to process even a few thousand records so nobody ran it unless they were forced to do so. Their data also had no referential integrity and could not be safely manipulated using transactions and references to MongoDB's API were scattered through out their code right up to the UI.
It made their development a hell because every one needing to run the code had to set up Hadoop and other "big data" stuff even for testing JavaScript. It made communication harder by killing the most effective communication channel i.e. the code: because it was no longer easily runnable. It made their production environment so slow that it was completely unusable for small data entered/uploaded interactively. And to top it, it also created security issues because their data store queries were exposed everywhere even in JavaScript.
What the hell happened?
Here's how @codinghorror, the founder of StackExchange puts it:

Lets be honest for once. Just using Hadoop and MongoDB (and other cool stuff in the market) does not automatically make your architecture highly scalable although it can perhaps make it highly fashionable. The old rules of the game still apply. You MUST build layers to encapsulate every external component or library you use, whether its a database or a messaging system or an execution engine. We, the Java folks, used to build data access layers for Oracle or MySQL and tried to hide SQL queries behind Spring/Hibernate not because SQL was bad or uncool but because over the years we had learnt to avoid tight coupling between our business logic and a particular vendor or product. It allowed us to replace or update most of our technology stack without significant pain making development, testing and production simpler. There is no reason the same should not be done with MongoDB and other NoSQL databases. Actually even more so because each NoSQL database has its own API and it does not even have the abstraction that SQL provided. Who knows you may have to replace MongoDB with Cassandra in your production some day. Make sure you do not have to throw away 70% of your code.
Its also necessary that the business logic running on Hadoop should be separated out into separate strategy classes that can be run and unit tested without the need for Hadoop clusters. It surely makes life easier for a developer and makes the code more communicative and self-explanatory for others. It also helps in the production where it saves one from the embarrassment of running excruciatingly long Hadoop jobs on a few dozen records that can easily be processed on a single node within milliseconds.
And why do you even use MongoDB? As Professor Stonebraker pointed out in his now infamous lectures in the MIT course, you should not use a non-relational database unless your data is really big and really unstructured and does not really require referential integrity and acidity. This effectively means never. Or at-least very rarely. Unfortunately the temptation to use NoSQL is high not only because its cool but also because it saves one from the trouble of creating a well thought-out data model (dump anything in MongoDB and it won't complain). But this laziness does have dangerous consequences. E.F. Codd wasn't an idiot when he advocated the importance of referential integrity and ACID transactions. You better use relational database unless you have a compelling reason not to do so (and no, being cool is not a compelling reason).
So in conclusion, its not bad to build your businesses around big data hype and boast about it. But make sure it is actually big. For new start ups, this may mean to start a little modestly using the conventional technology stack of relational databases and use layering and abstractions to make it loosely coupled. So that once their data grows to the size of terabytes and petabytes, they can easily migrate the same code to Hadoop and other components. Provided, of course, that these components exist when this happens. And provided their data does reach terabytes some day.
In most cases this may never happen.!!
The writer is a renegade software architect turned data scientist who moonlights as a developer.