Gaping Infinity

Tuesday, April 22, 2014

Internet's silent era must end.

Imagine a talking, conversational and affectionate internet!
Imagine logging into your social network and instead of reading the status updates, you listen to your friends talking in their own voices! Imagine logging into Amazon and be greeted by a salesperson describing the new arrivals you may like!! Imagine logging into Gmail and hearing your boss scolding you for missing a deadline!!!

Hang on! Are we going to talk about the next big thing that will relegate Siri and IBM Watson to the dust heap of history?

Not exactly.
On the contrary adding basic speech support to your application is so simple, it makes one wonder why not many people do it.
Here's how we can go about it. Step by step. And there's no need to fasten seat belts either.

Getting the spoken word

So now we will have to research all that speech recognition literature available on arXiv?

Not really. Most of you must have seen the big text box with a microphone on Google's page that lets you speak your search terms into it. Here's the big secret. Its not some Google's secret proprietary widget, its just a voice input tag added in HTML5 which you can use too (although to give devil its due Google are a big supporter of HTML5).

This text box will automatically convert the spoken words into text with some accuracy (although for non-native speakers it is unnecessarily quirky).
So no need to do research in speech recognition, the browser will convert it to text for us automatically. That was relief because speech recognition is a really complicated area.

Understanding what was asked

So now we have got user's query in text, how do we know what the user is asking?

That depends upon what kind of data your website has and what kind of sophistication you want.

If what you have is text data, you can just create lucene indexes for all that data (or let solr/elasticsearch do it) and then do a simple query with whatever user has uttered. Chances are you will get some useful results. Lucene automatically removes stop words, does stemming and synonymization and even spell correction (which comes in handy for non-native speakers). It does an excellent job of scoring the results and you can also do boosting on certain terms that are relevant to your domain. If you don't want anything more fancy, you can leave now.

On the other hand if your data is structured (or you plan to add some structure to it) where there are some entities which have relations with other entities, it gets tricky.
The simplest approach is to create names and synonyms for all your entities (or tables), their attributes (columns) and their relationships. The result will more likely resemble an RDF graph rather than a relational database. Here's an example: Take a user-role RDBMS schema like this one:

You just have to convert it to something like this:

Its not as scary as it looks. The above image is actually taken from the website of Neo4J which is a cool graph database useful for doing similar things. You also need to convert tokens of the query (after removing stop words) into some sort of graph query. Again Neo4J will be your best friend. Just see this link for an example: User roles in graphs

Clearly if you have a few hundred tables, creating synonyms for all tables/columns/relations will quickly get out of hand. But believe it or not it works in most cases. Most of you must be familiar with Facebook Graph Search. So when we type something like 'People who work at Facebook', Facebook does break the query and run on a graph similar to the one above. Although they claim to do more sophisticated stuff, when someone tried to run the query on an RDF saved in Neo4j his application was banned by Facebook because it was replicating their feature!!!

Its cool, its easy but quite obviously it won't work if some user asks a question like: Who is the highest paid employee in the sales department?

You can't easily convert it to a graph query, A more complicated and elaborate approach is to use Named Entity Recognition (NER). Again there are great open source tools available for your help. One of them is Apache OpenNLP. This link explains how you can train all your entities in OpenNLP to get an NER model and then construct the graph query from it.

We are getting unnecessarily close to a full blown question answering system like IBM Watson which is slightly uncomfortable. So lets end it here with a brief stop over.
Remember our aim is to make only our website accessible via natural language queries, so the scope should be proportional to the content of our website. An actual question answering system like Siri or IBM Watson tries to answer any question a user asks even if its related to cooking. For those of you who really want to build something like that, Manning's Taming Text is an excellent start.

There is another category of applications called chat bots that try to mimic humans by giving answers about random topics. Classically they use a language ostentatiously called the AIML (Artificial Intelligence Markup Language) to define patterns of questions and their answers. A typical AIML file looks like this. You can also use this syntax to create a chat bot that is specific to answering questions about your business or website. For your inspiration let me add that chat bots did rule the AI world in the good old days.

Wait. Instead of teaching my kid the rules of grammar, I will have to teach her the answer to every possible question she may encounter in her life?
Samantha: I can understand how the limited perspective of an unartificial mind might perceive it that way. You'll get used to it.

Finishing it off

So we have just got a table of results by asking a query instead of typing it. Is that why I wasted half an hour?

Calm down. Yes the table of results is what we will get but its indeed not very polite to show it to someone who has made the effort to make a verbal inquiry. We should pay him back in the same coin, i.e. we should announce the results articulately instead of showing an ugly table of results.
Thankfully text-to-speech conversion is far easier than speech recognition so we can even try to do it ourselves.
But wait. Ever heard of Java Speech API? Yes such a thing does exist and is a living relic of the long dead Java community process. Java Speech API lets you turn speech to text literally. It also has a popular open source implementation in FreeTTS.
Here's how a typical interaction with FreeTTS looks.

Voice alan = VoiceManager.getInstance().getVoice("alan");
alan.speak("Hi there. I am Alan.");

I am not joking. Thats how simple it is. Furthermore you can also choose from a set of male and female voices and can even use your own voice!!!

So all we need to do now, is to convert our results into some meaningful text and utter it. Just blithering them out will surely make us look stupid.

Thankfully FreeTTS also allows us to convert the audio into mp3/wav files and HTML5 has another widget for playing these audio files. So all we need is to stream our results back to the patiently waiting user. Or their browser.

And that's all folks.

OMG. Our website really talks back. We have just built the next big thing that will relegate Siri and...
Sigh!!!

Here are a few resources you can study to see how broad this topic actually is:

Deep QA: A description of IBM Watson.
Taming Text: A great book by Manning on text processing.
Apache Lucene: The most popular text search library.
Solr and Elasticsearch: Doing natural language search on top of lucene.
AIML: Artificial Intelligence Markup Language.
AIML Sets: Example AIML files for various topics.
ProgramD: An open source chat bot written in Java.
OpenNLP Manual: Manual of Apache OpenNLP.
FreeTTS: An open source text to speech engine.

The writer is a renegade software architect turned data scientist who moonlights as a developer.

Tuesday, April 15, 2014

"Big Data" is a qualified noun.

So its official. We have entered the trough of disillusionment for big data after this New York Times editorial went viral.

Now its time to laugh at anyone and everyone who ever tried to set up Hadoop clusters on their virtual machines hoping to put a ding in the universe. Let me say the signs were already there after no less a person than Michael Stonebraker had warned of an impending doom (he also makes similar strong claims in a very popular MIT Professional Education course on big data).

But obviously not many will agree (or disagree) with the merits of Professor Stonebraker's arguments. And its not the purpose of this post either. However being a developer I can guess where things usually go wrong from a developer's perspective when succumbing to the temptation of big data. So this post is more from the viewpoint of a developer rather than that of a data scientist.

Since Hadoop is primarily the brainchild of Java community (although its use is rather ubiquitous across the spectrum) its surprising how the very basic ideas like abstraction and layering we learnt as Java developers tend to get forgotten in the excitement of using Hadoop and its ecosystem.

I once worked with a start up that decided to build stuff leveraging big data. As is the norm, they coded most of their business logic in Hadoop's mapper and reducer classes and/or Apache Pig scripts. They also decided to throw a document store MongoDB in the mix even though most of the data they were storing in it was relational and was certainly not too big for MySQL. What was even worse they never tried to build layers to abstract out MongoDB or Hadoop APIs. The end result was predictably disastrous. Their code could not run without Hadoop even when they were processing a few megabytes of data or unit testing in their eclipse environments. When it did run it took almost half an hour to process even a few thousand records so nobody ran it unless they were forced to do so. Their data also had no referential integrity and could not be safely manipulated using transactions and references to MongoDB's API were scattered through out their code right up to the UI.

It made their development a hell because every one needing to run the code had to set up Hadoop and other "big data" stuff even for testing JavaScript. It made communication harder by killing the most effective communication channel i.e. the code: because it was no longer easily runnable. It made their production environment so slow that it was completely unusable for small data entered/uploaded interactively. And to top it, it also created security issues because their data store queries were exposed everywhere even in JavaScript.

What the hell happened?
Here's how @codinghorror, the founder of StackExchange puts it:

Lets be honest for once. Just using Hadoop and MongoDB (and other cool stuff in the market) does not automatically make your architecture highly scalable although it can perhaps make it highly fashionable. The old rules of the game still apply. You MUST build layers to encapsulate every external component or library you use, whether its a database or a messaging system or an execution engine. We, the Java folks, used to build data access layers for Oracle or MySQL and tried to hide SQL queries behind Spring/Hibernate not because SQL was bad or uncool but because over the years we had learnt to avoid tight coupling between our business logic and a particular vendor or product. It allowed us to replace or update most of our technology stack without significant pain making development, testing and production simpler. There is no reason the same should not be done with MongoDB and other NoSQL databases. Actually even more so because each NoSQL database has its own API and it does not even have the abstraction that SQL provided. Who knows you may have to replace MongoDB with Cassandra in your production some day. Make sure you do not have to throw away 70% of your code.

Its also necessary that the business logic running on Hadoop should be separated out into separate strategy classes that can be run and unit tested without the need for Hadoop clusters. It surely makes life easier for a developer and makes the code more communicative and self-explanatory for others. It also helps in the production where it saves one from the embarrassment of running excruciatingly long Hadoop jobs on a few dozen records that can easily be processed on a single node within milliseconds.

And why do you even use MongoDB? As Professor Stonebraker pointed out in his now infamous lectures in the MIT course, you should not use a non-relational database unless your data is really big and really unstructured and does not really require referential integrity and acidity. This effectively means never. Or at-least very rarely. Unfortunately the temptation to use NoSQL is high not only because its cool but also because it saves one from the trouble of creating a well thought-out data model (dump anything in MongoDB and it won't complain). But this laziness does have dangerous consequences. E.F. Codd wasn't an idiot when he advocated the importance of referential integrity and ACID transactions. You better use relational database unless you have a compelling reason not to do so (and no, being cool is not a compelling reason).

So in conclusion, its not bad to build your businesses around big data hype and boast about it. But make sure it is actually big. For new start ups, this may mean to start a little modestly using the conventional technology stack of relational databases and use layering and abstractions to make it loosely coupled. So that once their data grows to the size of terabytes and petabytes, they can easily migrate the same code to Hadoop and other components. Provided, of course, that these components exist when this happens. And provided their data does reach terabytes some day.
In most cases this may never happen.!!

The writer is a renegade software architect turned data scientist who moonlights as a developer.

Friday, July 17, 2009

Performance Monitoring of .Net Framework

References:
http://msdn.microsoft.com/en-us/library/fxk122b4.aspx
http://msdn.microsoft.com/en-us/library/w8f5kw2e.aspx
http://msdn.microsoft.com/en-us/library/ms254503%28VS.80%29.aspx
http://msdn.microsoft.com/en-us/library/aa394084%28VS.85%29.aspx
http://www.dnzone.com/go?180
http://www.oracle.com/technology/products/oem/pdf/ds_net.pdf
http://technet.microsoft.com/en-us/library/bb727100.aspx
http://technet.microsoft.com/en-us/library/cc736883(WS.10).aspx

Tuesday, June 23, 2009

Availability Analysis and Prediction

The issue with BezVision's performance analysis is that it assumes a fault-free system. If any thing goes wrong e.g. the database goes down, the collectors will stop and no profiles would be created at all. This makes it impossible to find out the source of the blackout and to identify the workload that caused it.

The solution is to incorporate availability and reliability analysis and prediction into BezVision's toolset.

We can gather several metrics by analysing collection failures like MTTF, Probability, Up Time, Down time, Reliability etc.

We will need a heartbeat collector with timeout for DB, heartbeat queries for node (may be running out of memory), disk (may be full try checking its space if its below threshhold.), network etc.

We can then analyse to data and predict the availability of a system at various points in time.

We can also analyse the workload that caused the system (dbms/node/disk) failure.

Availability Analysis and Audits

This section draws heavily from the paper:

Stochastic Reward Nets for Reliability Prediction

http://www.cs.ucr.edu/~ciardo/pubs/1994CRMS-Reliability.pdf

Availability Prediction

This section draws heavily from the paper:

Exploiting Availability Prediction in Distributed Systems

http://www.eecs.umich.edu/~jmickens/predictors.pdf

We have to define pre-defined states of availability.

Class-1: Completely Offline

Class-16: Completely Online

Instead of queuing networks, we have to define predictor networks. But first lets briefly describe the types of predictors mentioned in the paper.

1. Right Now Predictor

Put simply. If a system is online now, it will be online for all periods in the future. And if its offline, it will be offline for all periods.

2. Saturating Counter Predictors

A saturating counter prediction uses n-bits of state and can assume 2^n states.

e.g. SatCount-2 predictor uses a 2-bit saturating counter.

Such a counter can assume four values (-2,-1,+1,and +2) which correspond to four uptime states (strongly offline, weakly offline, weakly online, and strongly online).

During each sampling period, the counter is incremented if the node is online, otherwise it is decremented; increments and decrements cannot move the counter beyond the saturated counts of -2 and +2. Predictions for all lookahead periods use the current value of the saturating counter, i.e., negative counter values produce "offline"predictions, whereas positive values result in "online" predictions.

Saturating Counter Predictiors are pretty tolerant to occasional deviations from long stretches of uniform uptime behavior. However, like the RightNow predictors, they are inaccurate for nodes with periodic uptimes.

3. State-Based Predictors History Predictor

+ve: To predict the behavior of nodes with periodic availabilities.

These predictors explicitly represent a node’s recent uptime states using a de Bruijn graph. A de Bruijn graph over k bits has a vertex for each binary number in [0, 2k-1]. Each vertex with binary label b1b2...bk has two outgoing edges, one to the node labeled b2b3...0 and the other to the node b2b3...1.

Suppose that we represent a node’s recent availability as a k-bit binary string, with bi equal to 0 if the node was offline during the ith most recent sampling period and 1 if it was online. A k-bit de Bruijn graph will represent each possible transition between availability states.

To assist uptime predictions, we attach a 2-bit saturating counter to each vertex. These counters represent the likelihood of traversing a particular outbound edge; negative counter values bias towards the 0 path, whereas positive values bias towards the 1 path. After each uptime sampling, the counter for the vertex representing the previous uptime state is incremented or decremented according to whether the new uptime sample represented an "online" edge or an "offline" edge.

To make an uptime prediction for n time steps into the future, we trace the most likely path of n edges starting from the vertex representing the current uptime state. If the last bit we shift in is a 1, we predict the node will be online in n time units, otherwise we predict that it will be offline.

4. Twiddle History Predictors

Suppose that a node has a fundamentally cyclical uptime pattern, but the pattern is “noisy.” For example, a machine might be online 80% of the time between midnight and noon and always offline at other times. If the punctuated downtime between midnight and noon is randomly scattered, the de Bruijn graph will accumulate infrequently visited vertices whose labels contain mostly 1's but differ in a small number of bit positions.

As the length of time that we observe the node grows, noisy downtime will generate increasingly more vertices whose labels are within a few bit-flips of each other. Probabilistically speaking, we should always predict that the node will be online from midnight to noon.

However, the many vertices representing this time interval are infrequently visited and thus infrequently updated. Their counters may have weak saturations (-1 or +1) that poorly capture the underlying cyclic availability.

For nodes like this, we can nudge predictions towards the probabilistically favored ones by considering superpositions of multiple uptime states. Given a vertex v representing the current uptime history, we make a prediction by considering v’s counter and the counters of all observed vertices whose labels differ from v's by at most d bits.

For example, suppose that k=3 and d=1, and thateach of the 2k = 8 possible vertices corresponds to an actually observed uptime history. To make a prediction for the next time step when the current vertex has the label 111, we average the counter values belonging to vertices 111, 110, 101, and 011. If the average is greater than 0, we predict "online" otherwise we predict "offline".

The TwiddledHistory strategy will perform worse than the regular History strategy when vertices within d bits of each other correspond to truly distinct uptime patterns. In these situations, superposition amalgamates state belonging to unrelated availability behavior, reducing prediction accuracy.

5. Linear Predictor

It uses a linear combination of the last k signal points to predict future points. The k coefficients are chosen to reduce the magnitude of an error signal, which is assumed to be uncorrelated with the underlying “pure” signal. To make availability predictions for t time steps into the future, we iteratively evaluate the linear combination using the k most recent availability samples, shifting out the oldest data point and shifting in the predicted data point. Linear prediction produces good estimates for signals that are stable in the short term but oscillatory in the medium to long term.

We would expect this technique to work well with nodes having diurnal uptimes, e.g., machines that are online during the work day and offline otherwise.

6. Hybrid Predictor

A machine can transition between multiple availability patterns during its lifetime.

To dynamically select the best predictor for a given uptime pattern and lookahead interval, we employ a Hybrid predictor.

For each lookahead period of interest, the Hybrid predictor maintains tournament counters. These saturating counters determine the best predictor to use for that lookahead period.

Negative counter values select the left input, whereas positive values select the right.

Consider a Hybrid predictor making forecasts for an n-sample lookahead period. At the beginning of each time unit, the Hybrid predictor samples the current uptime state of its node. Its five sub-predictors are updated with this state, and each sub-predictor makes a prediction for n time units into the future.

The final output of the Hybrid predictor is selected via tournaments as shown in Figure 2, and the individual sub-predictions are placed in a queue and timestamped with curr time + n. If the head of the queue contains an entry whose timestamp matches the current time, the entry is dequeued and the tournament counters are updated using the dequeued predictions.

A tournament counter remains unchanged if both of the relevant dequeued predictions match the current uptime state or both do not match.

Otherwise, one prediction was right and the other was wrong, and the tournament counter is incremented or decremented appropriately. In the last stage of the update, the curr time value is incremented.

Monday, June 1, 2009

Age of Mediocres

Why is twitter such a huge success? And why Wolfram Alpha will never make it big?

The answer is simple. We are now living in the age of mediocres. Even when it comes to the tech-savvy elite.

Now I am not dissing the genious that has gone into the creation of this so-called social network. But seriously whats the science behind building a tool that simply shows what a person is doing? You can't derive any intelligence from it. You can't do any social network analysis using its relationships since these unapproved relationships mean nothing. Sure you are able to search real-time data which is cool but tell me is there any such thing as real-time wisdom. Isn't it like searching for Shakespeare's sonnet in the text typed by infinite monkeys?

I know I am missing the social aspect of it but tell me where is the technological innovation.

And so it seems is the technological trend these days. Making ordinary things that reach out to a million consumers. All the technology then goes into the various aspects of handling those million consumers. Gone are the days when science used to stun the humankind with its novelty. Now nobody wants to go where no one has gone before.

So whats the downside? Nothing really. Except that in all that myriad of applications, little innovative pieces can get lost very easily. I personally like the idea behind Wolfram Alpha. But will they reach out to a million people the way google did? I guess not. Its hard to find a million people in the universe who are intereseted in .. lets say mathematical equations. So despite the great innovation the site will not get the attention that Twitter seems to be getting. And thats unfortunate.

It may be that the last century of brilliant scientific discoveries has set the bar so high that its impossible to cross it by the relatively young computer sciences. Or it may be that we have again reached a point where there is nothing left to surprise the world. Or most likely we are truely living in an age of mediocres.

Friday, May 22, 2009

Feature Requests for BEZVision

I am sorry if this post appears cryptic for some users. Its my wishlist for a product I am currently working on. I have already posted details of the product here.

Some of these features need more elaboration and I will dedicate separate posts to them, updating this post with the links.

A. Workload Characterization

Workload rules for app tier

This is a feature conspicuous by its absence. We cannot create workload rules for app tier. Although we do not have UPMM in the app tier, we should still be able to create rules based on page urls, component names etc. We do save this information in component stats table so it should be easy to be able to define rules based on it.

Workload Clustering

The problem with aggregating data on the basis of UPMM is that very different requests (in terms of cpu and io) can be grouped together. If the requests have widely different response times, their average response time will not be a true measure. So we basically need to group requests together that are close to each other in response time, cpu time and/or io. This can be done by k-means clustering of raw data. We will have to store a percentile field with each request and characterize on the basis of that. Its somewhat already done in model service where requests are distributed in percentiles.

Workload Recommendation

This is in part inspired from Teradata Workload Analyzer. We should be able to analyze collected data, find patterns in them by using k-means clustering and then suggest a confingurable number of workload rules to the user.

More refined definition of workload rules. (Classifier function DSL)

SQL Server has a nice way of defining workload rules by creating a classifier function. We should have a similar way of definiing a lassifier function in a custom DSL that looks something like

if (stats.logicalReads > 1000) then place in 'High IO Workload'

else if (stats.cpuTime > 3600) then place in 'High CPU Workload'

Workloads based on types of requests. (INSERT, UPDATE)

This is self explanatory. We should be able to define workload on the basis of requests types as the demands/characteristics of select, insert, update and delete statements are all different.

Calculate Cost of each workload (a function of CPU and IO)

All RDBMS (oracle, sql server, teradata) have an internal cost algorithm that is used to compute the cost of a query. We should be able to have some customizable algorithm to compute the cost of requests so tham we are able to characterize requests on the basis of that.

~~Workload Priority while data collection~~

Again self-explanatory. We do have a priority field in workload stats, but its not populated while data collection (its only set while creating plans for a prediction). We need a way to get the workload priority while collecting data from database

~~Importing workload rules from monitored database~~

Teradata has a workload manager, SQL Server has a resource governor and Oracle has a resource manager that allow the user to define workload rules. We need a way to import those workload rules into BEZVision.

Asynchronous workload data collection

Asynchronous workload is the workload not associated with any request. We need a standard way to create asynchronous workload while data collection and in model service.

Workload Type (Batch, DSS, OLTP)

I wonder why we do not define the workload type as the queuing network for batch and OLTP workloads are different from each other.

B. Prediction Change Plans

In App- tier : Implement SSL Plan

Implementing SSL increases the response time of the page. This should allow the user to select an encryption algorithm and other SSL parameters and foresee the impact of implementing SSL.

Change DBMS instance type change plan

This plan is conspicuous by its absence, as we have node type change plan and JVM configuration change plan. We should also have a DBMS instance type change plan. This will allow the user to test various configurations beforehand. Will also be able to predict the impact of a db upgrade.

Cost of Planning Scenario. Total Cost of Ownership. Relative Cost

C. Analysis & Prediction

Network data collection to use in the alignment of interconnect messages

Besides databases and jvms, the appllication should also be able to analyse and predict network configurations. We should atleast collect some network stats to use in the alignement of interconnect stats.

Collect and analyse memory data

Again memory data is never collected or analysed. We do not need it in prediction engine, but it may be useful to have some sort of anaysis available for memory as well such as that available for storage.

Data collection and Performance Analysis of .Net Framework

The microsoft .net framework is the logical extension after the support for SQL Server. Currently BEZVision only supports Java application servers in the app tier. Supporting a multitier system with .net framework and sql server would be useful. For performance data collection the framework provides a rich set of performance counters that can be read by WMI or the typeperf command.

Availability Analysis and Prediction

We can gather several metrics by analysing collection failures like MTTF, Probability, Up Time, Down time. We will need a heartbeat query with timeout for DB, heartbeat queries for node (may be running out of memory), disk (may be full try checking its space if its below threshhold.), network etc. We can then analyse to data and predict the availability of a system at various points in time. We can also analyse the workload that caused the system (dbms/node/disk) failure. Again this will not involve queuing theory but will be a done in the way described here.

Predict usage patterns (workload/activity prediction): queries. page accesses (collective intelligence)

This is not a related feature but since we are gathering so much user data, we should be able to derive some intelligence from it. e.g. predict db/app usage patterns etc.

Save an analyse query plans

Every database (Oracle, SQL Server and Teradata atleast) save the query plans associated with a query that conatns several useful information. We need a way to extract and anlyse that information.

D. Advice

App tier advice

We can have an app tier advice similar to (http://code.google.com/speed/page-speed/). We can also have an app tier catalog snapshot which will include the list of components (jsp pages, ejbs etc) in the app tier.

Database Configuration Advisor

Need a database configuration advisor that will suggest changes in db configuration based on the workload. It may be extended to a general purpose advisor that runs various predictions with different set of configurations and then advises on the basis of that (its already done as the PE generates recommendations in case of a contention point but it needs to be generalized and automated).

E. Audits

Custom Alert Rules.
Audit Exception Advice.
Custom audits. Define a baseline for every metric. Alert for deviations. Automatic corrective actions.

F. User Interface

Improve data presentation in the UI. Gantt and Kiviat charts.