Are you ready: November 2017

Sunday, 26 November 2017

Big data query system

HDFS was born on 2006 to fix the distributed storage/computing problem and then came spark riding on cheap hardware in 2012 to fix distributed processing problem in disruptive style.

One problem space that has seen lot of innovative tools is "How to build information system" for huge data, cheap hardware has enabled to build fast analytics system by using in-Memory techniques.
Data is growing every moment and it is becoming very challenging to all analytics tools to keep-up with it, some system use Column stores, special index like Bitmap index etc to build responsive system.

One more technique called Summarization is also used in many system to solve this problem, in this blog i will discuss some ideas around Summarization and tradeoff of using this technique.

Definition of summarization as per vocabulary.com is

To summarize means to sum up the main points of something — a summarization is this kind of summing up. Elementary school book reports are big on summarization.

When you're a trial lawyer, the last part of the argument you make before the court is called a summation. It included a summarization of the points you have made in the trial so far, but it goes one step further, to push this summarization toward a conclusion you hope the judge or jury will accept.

This picture explains it, you take something big and make it small by retaining important attributes.

One more technique is Sampling that also helps in reducing size of something that is unmanageable to manageable, but it trade-offs accuracy for this and may system can tolerate some percentage of error.

Sampling is also used to to validate some tests or assumption, so sampling allows to consumer information using limited resource.

Summarization and Sampling can used together to build some fast information system.

One of the simple summary approach is build group the data into different set and each record/row should be just in one set.

Lets take a toy data set to build groups.

year	sex	Name	type_of_course
1993	Male	M1	Humanities & Social Sciences
1993	Male	M2	Mass Communication
1993	Male	M3	Accountancy
1993	Male	M4	Business & Administration
1993	Male	M5	Law
1994	Male	M6	Education
1994	Male	M7	Applied Arts
1994	Male	M8	Humanities & Social Sciences
1994	Female	F1	Education
1994	Female	F2	Applied Arts
1994	Female	F3	Humanities & Social Sciences
1994	Male	M9	Law

This data is for what type of course people study, it can split into couple of groups using year,sex,type of course.

1993,Male,Humanities & Social Science
1994,Female,Education
1994,Male,Law
1993,Male,Law

Grouping convert huge data set to small set and then sampling can be done based on these groups to pick up items from this group.

One of the way is to take random sample but it has limitation that some group will be missed and sample data will be not representative to original dataset and it will be impossible to answer many query.

We need smart sampler and stats guru have already figured it out, it is called Stratified Sampling

This sampling involves
- Create Strata, this is nothing but create groups.

year	sex	type_of_course	no_of_graduates
1994	Males	Humanities & Social Sciences	512
1994	Males	Business & Administration	413
1994	Females	Information Technology	196
1994	Females	Architecture & Building	182
1994	Females	Engineering Sciences	227

		Total	1530

- Decided sample size

Lets take 5% of sample that comes to 77 (5% of 1530)

- Compute contribution of each group

year	sex	type_of_course	no_of_graduates	Contribution
1994	Males	Humanities & Social Sciences	512	33.46405229
1994	Males	Business & Administration	413	26.99346405
1994	Females	Information Technology	196	12.81045752
1994	Females	Architecture & Building	182	11.89542484
1994	Females	Engineering Sciences	227	14.83660131

		Total	1530

- Select sample records

77 records are required and break of 77 will be based weight of individual group

year	sex	type_of_course	no_of_graduates	Contribution	Sample Size ( 77)
1994	Males	Humanities & Social Sciences	512	33.46405229	25.6
1994	Males	Business & Administration	413	26.99346405	20.65
1994	Females	Information Technology	196	12.81045752	9.8
1994	Females	Architecture & Building	182	11.89542484	9.1
1994	Females	Engineering Sciences	227	14.83660131	11.35

Number of records selected are based on weight of group and this gives data from each group.

By this approach 1530 records got reduced to 77 , lets try to answer some of query using this sample data, we also try to estimate what it will look at 100% and calculate error.

100 % estimate = 1% * 100

1% = Sample Answer/Sample %

Question	Actual Answer	5% Sample	100% ( Estimated)	Error
No of people graduated in 1994	1530	77	1540	0.006535947712
Males graduated in 1994	925	46	920	0.005405405405
All Females	605	30	600	0.00826446281
Engineering Sciences student	227	11	220	0.03083700441

Look at the error column, worst answer has 3% error and with this small error query can be answer in milli second vs seconds.

I used singapore open data graduates-from-university-first-degree-courses-by-type-of-course to build app that answer query based on sampling.

Code is available @ github

Machine Learning 101

Data science is trending these days, everybody wants to become data scientist because it is cutting edge and has high paying jobs.

I am also starting to learn this coooool stuff and best way to learn is write some code and get the feel of it.
I will start blog post series how to approach machine learning problem without getting lost in jargons and complex mathematical formulas.

Some of problem suited for ML are
- Email classification ( Spam Vs Ham)
- Books/Hotel/Food etc review classification
- Sentiment analysis of brand or product.

In each of the above problem output is some kind of label(spam,ham, positive, negative etc)
These type of problem are classification problem, lets look at diagram to understand this.

Email is problem instance.
Classifier is the process that know rules to classify
Spam/Ham is the output. If there is only 2 output then it is called binary classifier.

Lets zoom in "Classifier", it is rule engine that know all the business rules to do the work.
Rules can be static or dynamic, in some case static rules are fine but for many problem rules keeps on changing and it becomes overhead to keep rules up to date.

This is where ML comes in picture, system learns new rules by it self.

How does system learns ?
You have to train the system and after that training system identifies the patterns, this type of ML is called Supervised learning

Each Supervised learning has Training phase before it starts making predication.

Think training phase as practice, it contains both questions and answers.

Once training phase is done then classifier is ready for use on new questions without answer.

There is one more phase called "Test", it is used to check how good classifier is, if it is not good then it means training data was not good.

Now lets try to do some sentiment analysis using imdb reviews

Review	Type
Wasted two hours	Negative
A bit predictable	Negative
The story itself is just predictable and lazy	Negative
Buy it, play it, enjoy it, love it	Positive
It deserves strong love	Positive

In learning phase model takes above data and works out pattern for positive and negative comments.

You must be thinking how model knows about pattern.

One way to look at the problem is using probability of word being in positive comment vs negative comment and then take the one that is greater.

Lets look at example to understand this.

"Wasted" and "predictable" only appears only in -ve, so if any comment that contains these word they have high chance of being -ve comment.

"Love" comes only in the +ve .

So algorithm is very simple.

- Split sentence in words

- POS Score = (Compute +ve score of words & multiply them)

- NEG Score = (Compute -ve score of words & multiply them)

- Result = Max(Pos Score,Neg Score)

Before scores are compared we have to multiply with overall probability of positive or negative comments.

In our toy example

Total Reviews	5
Pos Comment	2
Neg Comment	3
Pos Probability( 2/5)	0.4
Neg Probability( 3/5)	0.6

POS Score	Pos(words in sentence) * Overall Pos(i.e 0.4)
Neg Score	Neg(words in sentence) * Overall Neg(i.e 0.6)

You don't have to code this algorithm, it is already implemented in Naive Bayes classifier.

Although this classifier has 'Naive' in name but it is not really naive, it gives very good result with around close to 80% accuracy and it is used in many production system.

All language/framework like python,spark,R has implementation for this classifier.

I have implemented this classifier in python/spark using Sentiment Labelled Sentences data from UCI ML repo.

This sample data contains reviews from imdb,yelp & amazon.

Code is available @ github

Python Classifier

Spark Classifier