Imagine a talking, conversational and affectionate internet!
Imagine logging into your social network and instead of reading the status updates, you listen to your friends talking in their own voices! Imagine logging into Amazon and be greeted by a salesperson describing the new arrivals you may like!! Imagine logging into Gmail and hearing your boss scolding you for missing a deadline!!!
Imagine logging into your social network and instead of reading the status updates, you listen to your friends talking in their own voices! Imagine logging into Amazon and be greeted by a salesperson describing the new arrivals you may like!! Imagine logging into Gmail and hearing your boss scolding you for missing a deadline!!!
Hang on! Are we going to talk about the next big thing that will relegate Siri and IBM Watson to the dust heap of history?
Not exactly.
On the contrary adding basic speech support to your application is so simple, it makes one wonder why not many people do it.
Here's how we can go about it. Step by step. And there's no need to fasten seat belts either.
On the contrary adding basic speech support to your application is so simple, it makes one wonder why not many people do it.
Here's how we can go about it. Step by step. And there's no need to fasten seat belts either.
Getting the spoken word
So now we will have to research all that speech recognition literature available on arXiv?
Not really. Most of you must have seen the big text box with a microphone on Google's page that lets you speak your search terms into it. Here's the big secret. Its not some Google's secret proprietary widget, its just a voice input tag added in HTML5 which you can use too (although to give devil its due Google are a big supporter of HTML5).
This text box will automatically convert the spoken words into text with some accuracy (although for non-native speakers it is unnecessarily quirky).
So no need to do research in speech recognition, the browser will convert it to text for us automatically. That was relief because speech recognition is a really complicated area.
So no need to do research in speech recognition, the browser will convert it to text for us automatically. That was relief because speech recognition is a really complicated area.
Understanding what was asked
So now we have got user's query in text, how do we know what the user is asking?
That depends upon what kind of data your website has and what kind of sophistication you want.
If what you have is text data, you can just create lucene indexes for all that data (or let solr/elasticsearch do it) and then do a simple query with whatever user has uttered. Chances are you will get some useful results. Lucene automatically removes stop words, does stemming and synonymization and even spell correction (which comes in handy for non-native speakers). It does an excellent job of scoring the results and you can also do boosting on certain terms that are relevant to your domain. If you don't want anything more fancy, you can leave now.
On the other hand if your data is structured (or you plan to add some structure to it) where there are some entities which have relations with other entities, it gets tricky.
The simplest approach is to create names and synonyms for all your entities (or tables), their attributes (columns) and their relationships. The result will more likely resemble an RDF graph rather than a relational database. Here's an example: Take a user-role RDBMS schema like this one:
The simplest approach is to create names and synonyms for all your entities (or tables), their attributes (columns) and their relationships. The result will more likely resemble an RDF graph rather than a relational database. Here's an example: Take a user-role RDBMS schema like this one:
You just have to convert it to something like this:
Its not as scary as it looks. The above image is actually taken from the website of Neo4J which is a cool graph database useful for doing similar things. You also need to convert tokens of the query (after removing stop words) into some sort of graph query. Again Neo4J will be your best friend. Just see this link for an example: User roles in graphs
Clearly if you have a few hundred tables, creating synonyms for all tables/columns/relations will quickly get out of hand. But believe it or not it works in most cases. Most of you must be familiar with Facebook Graph Search. So when we type something like 'People who work at Facebook', Facebook does break the query and run on a graph similar to the one above. Although they claim to do more sophisticated stuff, when someone tried to run the query on an RDF saved in Neo4j his application was banned by Facebook because it was replicating their feature!!!
Its cool, its easy but quite obviously it won't work if some user asks a question like: Who is the highest paid employee in the sales department?
You can't easily convert it to a graph query, A more complicated and elaborate approach is to use Named Entity Recognition (NER). Again there are great open source tools available for your help. One of them is Apache OpenNLP. This link explains how you can train all your entities in OpenNLP to get an NER model and then construct the graph query from it.
We are getting unnecessarily close to a full blown question answering system like IBM Watson which is slightly uncomfortable. So lets end it here with a brief stop over.
Remember our aim is to make only our website accessible via natural language queries, so the scope should be proportional to the content of our website. An actual question answering system like Siri or IBM Watson tries to answer any question a user asks even if its related to cooking. For those of you who really want to build something like that, Manning's Taming Text is an excellent start.
Remember our aim is to make only our website accessible via natural language queries, so the scope should be proportional to the content of our website. An actual question answering system like Siri or IBM Watson tries to answer any question a user asks even if its related to cooking. For those of you who really want to build something like that, Manning's Taming Text is an excellent start.
There is another category of applications called chat bots that try to mimic humans by giving answers about random topics. Classically they use a language ostentatiously called the AIML (Artificial Intelligence Markup Language) to define patterns of questions and their answers. A typical AIML file looks like this. You can also use this syntax to create a chat bot that is specific to answering questions about your business or website. For your inspiration let me add that chat bots did rule the AI world in the good old days.
Wait. Instead of teaching my kid the rules of grammar, I will have to teach her the answer to every possible question she may encounter in her life?
Samantha: I can understand how the limited perspective of an unartificial mind might perceive it that way. You'll get used to it.
Samantha: I can understand how the limited perspective of an unartificial mind might perceive it that way. You'll get used to it.
Finishing it off
So we have just got a table of results by asking a query instead of typing it. Is that why I wasted half an hour?
Calm down. Yes the table of results is what we will get but its indeed not very polite to show it to someone who has made the effort to make a verbal inquiry. We should pay him back in the same coin, i.e. we should announce the results articulately instead of showing an ugly table of results.
Thankfully text-to-speech conversion is far easier than speech recognition so we can even try to do it ourselves.
But wait. Ever heard of Java Speech API? Yes such a thing does exist and is a living relic of the long dead Java community process. Java Speech API lets you turn speech to text literally. It also has a popular open source implementation in FreeTTS.
Here's how a typical interaction with FreeTTS looks.
Thankfully text-to-speech conversion is far easier than speech recognition so we can even try to do it ourselves.
But wait. Ever heard of Java Speech API? Yes such a thing does exist and is a living relic of the long dead Java community process. Java Speech API lets you turn speech to text literally. It also has a popular open source implementation in FreeTTS.
Here's how a typical interaction with FreeTTS looks.
Voice alan = VoiceManager.getInstance().getVoice("alan");
alan.speak("Hi there. I am Alan.");
alan.speak("Hi there. I am Alan.");
I am not joking. Thats how simple it is. Furthermore you can also choose from a set of male and female voices and can even use your own voice!!!
So all we need to do now, is to convert our results into some meaningful text and utter it. Just blithering them out will surely make us look stupid.
Thankfully FreeTTS also allows us to convert the audio into mp3/wav files and HTML5 has another widget for playing these audio files. So all we need is to stream our results back to the patiently waiting user. Or their browser.
And that's all folks.
OMG. Our website really talks back. We have just built the next big thing that will relegate Siri and...
Sigh!!!
OMG. Our website really talks back. We have just built the next big thing that will relegate Siri and...
Sigh!!!
Here are a few resources you can study to see how broad this topic actually is:
- Deep QA: A description of IBM Watson.
- Taming Text: A great book by Manning on text processing.
- Apache Lucene: The most popular text search library.
- Solr and Elasticsearch: Doing natural language search on top of lucene.
- AIML: Artificial Intelligence Markup Language.
- AIML Sets: Example AIML files for various topics.
- ProgramD: An open source chat bot written in Java.
- OpenNLP Manual: Manual of Apache OpenNLP.
- FreeTTS: An open source text to speech engine.
The writer is a renegade software architect turned data scientist who moonlights as a developer.