Are you ready: 2017

Sunday, 3 December 2017

Pragmatic Programmer - Boiled frog.

Pragmatic Programmer book has lots of solid advice and it was published 18(1999) years ago and all that advice is still relevant.

It has one section on Stone Soup & Boiled frogs.

Boiled frogs

This theory is not tested but it is said that if you take frog and drop it into boiling water, it will jump out of it but if you place frog in pan of clod water, then gradually heat it, the frog won't notice and slow increase in temperature will be not noticed by frog and it will be cooked.

Stone Soup

3 soldiers returning home after war were hungry. When they saw the village ahead their spirits lifted they were sure that people will give them food. But doors were locked and windows were closed. After many years of war, the villagers were short on food and hoarded what they had.

The soldiers boiled a pot of water and carefully placed three stones into it. The amazed villager came out to know more about it.

Soldiers explained that this is "Stone Soup" and went to to add "although some say that it tastes even better with few carrots"

Villagers ran off and got basket of carrots and asked "is that ok ?"

"Well" said the soldiers "a couple of potatoes give it body"

Another villager ran and got some potatoes and each of villager slowly taking stuff out and eventually they had produced large pot of steaming soup.

Soldiers removed stone and sat with villager and everybody enjoyed meal.

What do we learn ?

Stone soup tells that you need act as catalyst to bring the people together so that they can jointly produced something that could not have done themselves and eventually every one wins.

Does this sound like issues that we face every day to get work done in team ?

This mindset is called "start-up fatigue"

If you are stuck in getting you idea then take out stones, build something small but well enough to say "it would be better if we add..........." and pretend it is not important.

Sit back and wait for them to start asking for the stuff you originally wanted.

People find it easy to join to the ongoing success :-)

Lets look at frog story.

Don't be frog, keep eye on the changes happening around you , in your team, company, industry.

Constantly review what is happening and take some action.

Don't become frog, tech world is changing fast, new skills are required, new approach is required , new tools, lot of un-learning is required to learn something new.

Frog will not noticed change, if you don't want to frog then learn to embrace the change.

Moral of story is

Be a catalyst for change
Remember the Big Picture

Read Pragmatic-Programmer-Journeyman-Master book if you have not read it.

Sunday, 26 November 2017

Big data query system

HDFS was born on 2006 to fix the distributed storage/computing problem and then came spark riding on cheap hardware in 2012 to fix distributed processing problem in disruptive style.

One problem space that has seen lot of innovative tools is "How to build information system" for huge data, cheap hardware has enabled to build fast analytics system by using in-Memory techniques.
Data is growing every moment and it is becoming very challenging to all analytics tools to keep-up with it, some system use Column stores, special index like Bitmap index etc to build responsive system.

One more technique called Summarization is also used in many system to solve this problem, in this blog i will discuss some ideas around Summarization and tradeoff of using this technique.

Definition of summarization as per vocabulary.com is

To summarize means to sum up the main points of something — a summarization is this kind of summing up. Elementary school book reports are big on summarization.

When you're a trial lawyer, the last part of the argument you make before the court is called a summation. It included a summarization of the points you have made in the trial so far, but it goes one step further, to push this summarization toward a conclusion you hope the judge or jury will accept.

This picture explains it, you take something big and make it small by retaining important attributes.

One more technique is Sampling that also helps in reducing size of something that is unmanageable to manageable, but it trade-offs accuracy for this and may system can tolerate some percentage of error.

Sampling is also used to to validate some tests or assumption, so sampling allows to consumer information using limited resource.

Summarization and Sampling can used together to build some fast information system.

One of the simple summary approach is build group the data into different set and each record/row should be just in one set.

Lets take a toy data set to build groups.

year	sex	Name	type_of_course
1993	Male	M1	Humanities & Social Sciences
1993	Male	M2	Mass Communication
1993	Male	M3	Accountancy
1993	Male	M4	Business & Administration
1993	Male	M5	Law
1994	Male	M6	Education
1994	Male	M7	Applied Arts
1994	Male	M8	Humanities & Social Sciences
1994	Female	F1	Education
1994	Female	F2	Applied Arts
1994	Female	F3	Humanities & Social Sciences
1994	Male	M9	Law

This data is for what type of course people study, it can split into couple of groups using year,sex,type of course.

1993,Male,Humanities & Social Science
1994,Female,Education
1994,Male,Law
1993,Male,Law

Grouping convert huge data set to small set and then sampling can be done based on these groups to pick up items from this group.

One of the way is to take random sample but it has limitation that some group will be missed and sample data will be not representative to original dataset and it will be impossible to answer many query.

We need smart sampler and stats guru have already figured it out, it is called Stratified Sampling

This sampling involves
- Create Strata, this is nothing but create groups.

year	sex	type_of_course	no_of_graduates
1994	Males	Humanities & Social Sciences	512
1994	Males	Business & Administration	413
1994	Females	Information Technology	196
1994	Females	Architecture & Building	182
1994	Females	Engineering Sciences	227

		Total	1530

- Decided sample size

Lets take 5% of sample that comes to 77 (5% of 1530)

- Compute contribution of each group

year	sex	type_of_course	no_of_graduates	Contribution
1994	Males	Humanities & Social Sciences	512	33.46405229
1994	Males	Business & Administration	413	26.99346405
1994	Females	Information Technology	196	12.81045752
1994	Females	Architecture & Building	182	11.89542484
1994	Females	Engineering Sciences	227	14.83660131

		Total	1530

- Select sample records

77 records are required and break of 77 will be based weight of individual group

year	sex	type_of_course	no_of_graduates	Contribution	Sample Size ( 77)
1994	Males	Humanities & Social Sciences	512	33.46405229	25.6
1994	Males	Business & Administration	413	26.99346405	20.65
1994	Females	Information Technology	196	12.81045752	9.8
1994	Females	Architecture & Building	182	11.89542484	9.1
1994	Females	Engineering Sciences	227	14.83660131	11.35

Number of records selected are based on weight of group and this gives data from each group.

By this approach 1530 records got reduced to 77 , lets try to answer some of query using this sample data, we also try to estimate what it will look at 100% and calculate error.

100 % estimate = 1% * 100

1% = Sample Answer/Sample %

Question	Actual Answer	5% Sample	100% ( Estimated)	Error
No of people graduated in 1994	1530	77	1540	0.006535947712
Males graduated in 1994	925	46	920	0.005405405405
All Females	605	30	600	0.00826446281
Engineering Sciences student	227	11	220	0.03083700441

Look at the error column, worst answer has 3% error and with this small error query can be answer in milli second vs seconds.

I used singapore open data graduates-from-university-first-degree-courses-by-type-of-course to build app that answer query based on sampling.

Code is available @ github

Machine Learning 101

Data science is trending these days, everybody wants to become data scientist because it is cutting edge and has high paying jobs.

I am also starting to learn this coooool stuff and best way to learn is write some code and get the feel of it.
I will start blog post series how to approach machine learning problem without getting lost in jargons and complex mathematical formulas.

Some of problem suited for ML are
- Email classification ( Spam Vs Ham)
- Books/Hotel/Food etc review classification
- Sentiment analysis of brand or product.

In each of the above problem output is some kind of label(spam,ham, positive, negative etc)
These type of problem are classification problem, lets look at diagram to understand this.

Email is problem instance.
Classifier is the process that know rules to classify
Spam/Ham is the output. If there is only 2 output then it is called binary classifier.

Lets zoom in "Classifier", it is rule engine that know all the business rules to do the work.
Rules can be static or dynamic, in some case static rules are fine but for many problem rules keeps on changing and it becomes overhead to keep rules up to date.

This is where ML comes in picture, system learns new rules by it self.

How does system learns ?
You have to train the system and after that training system identifies the patterns, this type of ML is called Supervised learning

Each Supervised learning has Training phase before it starts making predication.

Think training phase as practice, it contains both questions and answers.

Once training phase is done then classifier is ready for use on new questions without answer.

There is one more phase called "Test", it is used to check how good classifier is, if it is not good then it means training data was not good.

Now lets try to do some sentiment analysis using imdb reviews

Review	Type
Wasted two hours	Negative
A bit predictable	Negative
The story itself is just predictable and lazy	Negative
Buy it, play it, enjoy it, love it	Positive
It deserves strong love	Positive

In learning phase model takes above data and works out pattern for positive and negative comments.

You must be thinking how model knows about pattern.

One way to look at the problem is using probability of word being in positive comment vs negative comment and then take the one that is greater.

Lets look at example to understand this.

"Wasted" and "predictable" only appears only in -ve, so if any comment that contains these word they have high chance of being -ve comment.

"Love" comes only in the +ve .

So algorithm is very simple.

- Split sentence in words

- POS Score = (Compute +ve score of words & multiply them)

- NEG Score = (Compute -ve score of words & multiply them)

- Result = Max(Pos Score,Neg Score)

Before scores are compared we have to multiply with overall probability of positive or negative comments.

In our toy example

Total Reviews	5
Pos Comment	2
Neg Comment	3
Pos Probability( 2/5)	0.4
Neg Probability( 3/5)	0.6

POS Score	Pos(words in sentence) * Overall Pos(i.e 0.4)
Neg Score	Neg(words in sentence) * Overall Neg(i.e 0.6)

You don't have to code this algorithm, it is already implemented in Naive Bayes classifier.

Although this classifier has 'Naive' in name but it is not really naive, it gives very good result with around close to 80% accuracy and it is used in many production system.

All language/framework like python,spark,R has implementation for this classifier.

I have implemented this classifier in python/spark using Sentiment Labelled Sentences data from UCI ML repo.

This sample data contains reviews from imdb,yelp & amazon.

Code is available @ github

Python Classifier

Spark Classifier

Wednesday, 25 October 2017

Data structure for big data

Today crunching Terabyte of data is very common and it brings up interesting challenges.

Memory is getting cheaper and it is enabling application to keep all the data in memory and process it but some data set are very huge and does not fit in single machine, incase it fits in memory but latency of processing will be not a acceptable.

Lets take a example to understand this

You are building web analytics product that collects all the clicks/page load etc events, each event object will have some of these attributes.

Volume generated by such type of tracking is huge and to ask any interesting question on such data set is resource intensive.

Lets look at some of the resource intensive questions

- How many distinct ip address are present
- How many distinct visitor Id
- Top 10 Pages
- Top X Visitor
- How many times Top X pages are requested
- Any page requested by blacklisted IP or user

Simple way to solve distinct query is to build set and count the items in set

Building set is going to take memory proportional to number of distinct element and set has some load factor to make sure read/write operation are constant time. So memory wise it is super expensive.

Key insight to solve the problem is how accurate distinct count should be , it is OK to have some error(i.e 5 to 10%) and if answer is yes then some magic can done.

Since this problem is about just the count , so we don't have to maintain real value and that will give huge saving, algorithm needs some way to track whether current value is seen or not.

at high level algorithm is
- Compute the hash
- Find the index in bit vector
- Mark the vector if is not set and increment the distinct counter.

It is very simple algorithm and does the job, it is called Linear Counting.
This works well but it has some limitation that you need to workout size which controls accuracy to get best result it should be close to number of distinct values you expect to see.

There is another class of algorithm called LogLog/Hyper LogLog which more powerful than Linear Counting.

Hash value is represented as binary value and then it is split in sub-string
One part is used to extract which bucket value should go
Next part is used to calculating value and it is done by counting no of leading zeros.

By using this technique each entry is put in different buckets for eg values starting with 01- Goes in X, values starting 001 goes in Y bucket.

To compute the final value all the buckets values are taken and it is averaged by applying some factor.

By using this technique billion elements can be counted by using very less memory.

HyperLogLog paper contains all the details of algorithm.

Linear Counting or Hyper Log can be used to answer below questions
- How many distinct ip address are present
- How many distinct visitor Id
- Distinct visitor by region, time etc
- Trending words like google trends
- Near similar document matching to remove duplicate document, it is used by search engine.

This data structure can be used for quick anomaly/fraud detection.
This also can be used for Nested_loop_join , spark is using it in ML lib for MatrixFactorizationModel and few more places to count keys in RDD.

One of the nice properties of these data structure is merge operation.
for e.g. one hyper log can be maintained for unique visitor for each day and they can merged to get value for any time range.

Lets look at some memory usage numbers, it is computed using stream-lib
In this benchmark i used all the words from HarryPotterseries book.

Actual Count	11983

Type	Memory ( KB)	Count	Error %
Set	431	11983	0
LinearCounting	31	12167	1.5
HyperLog	14	12253	2.2

In less than 1% of memory with 2 % of error this data structure is able to do cool thing !

Another interesting data structure for membership query is bloom filter, i wrote about in mr-cool-bloom-filter blog post.

Code used for bench-marking is available @ DistinctCountingApp.java

Saturday, 16 September 2017

Unix design patterns and philosophy

Design patterns got very famous after release of Design-Patterns-Elements-Reusable-Object-Oriented book and it became one of the advance topic of software engineering.

It become one of the must book to have it on desk, it also gave common vocabulary to talk about design. Design pattern idea was extended to capture solution to recurring problem.

Linda Rising took same idea and wrote book on Fearless Change Patterns Introducing Ideas , she wrote part2 of book More Fearless Change and i am sure lot more books will come in future which will present patterns idea in different shapes & forms.

In this blog i will share design patterns used in UNIX.

All the patterns are 30+ year old but are so much relevant in microservices time.

Lets start with famous quote from "C. A. R. Hoare"

"There are two ways of constructing a software design. One is to make it so simple that there are obviously no deficiencies; the other is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult."

The Emperor’s Old Clothes, CACM February 1981
—C. A. R. Hoare

Bottom up design

First design philosophy is unix is build using bottom up approach, it is build from lego blocks, built small pieces and then compose them to create something more useful. Functional programming is based on this principal !

Make each program do one thing well

This is such a simple rule but not used enough, developer loves complexity and this rule is violated everyday and results in software that does too many things but many features are not implemented properly.

Some of the example from UNIX of one thing well are grep, diff , patch , Yacc.
These programs are bugfree and so much that functionality is taken for graunted.

Design programs to be connected with other programs(Composition)

This is the rule that invented PIPE (i.e | ) , if our program are build using this rule then lot of time/money spent on integration can be saved.

In Unix world binary input/output is avoided and this makes integration so easy.
Unix program separate interface from logic, interface is "Text" stream and produces "Text" stream.

Each program is independent and they can be composed together because they follow simple rule of "text" interface.

As a developer we like to do premature optimization and start with binary input/output, use binary format only when bottleneck is proved.

Build software that can be tried early may be in in few days or 1 week.

This philosophy is also applicable for complex software like Operating System/ Device Driver etc, today we have to take help of Agile to build software that can be tried in 2 or 3 weeks.

Unix guys were doing Agile in 1970.

Avoid Fancy algorithm

Fany algorithm are complex to implement and have bugs and gives dividend only when N is above some threshold, by using simple algo and data structure lot of complex problem can be solved.

Data dominates

Spend more time on building data structure that will organize data properly.

Write simple parts connected by clean interface

This is the most difficult on to get right, we have seen different programming paradigm starting form assembler , structural , object oriented , functional etc but all of them have failed, as soon as program turns from POC to real it can't fit in head of human brain.

Clarity over Cleverness

I am sure every developer will have some war story to tell about debugging Clever program.

Rule of Transparency

This is one the things that is thought on last day of development in-spite of knowing that this rule is to make inspection and debugging of program easier.

Rule of Repair

When you fail, fail noisily and as soon as possible.

Single Point of Truth(SPOT) rule

This rules says that every piece of knowledge must have single, unambiguous representation in system. Repetition leads to inconsistency so use Constants, tables, metadata, code generator . Complicity is cost , don't pay it twice.

All the philosophy boils down to

Monday, 24 July 2017

Java Intrinsic magic

I got question on atomicinteger-java-7-vs-java-8 blog about implementation of unsafe.getAndAddInt function

It looks like JDK8 is still using CAS version.
If you just look at the code in IDE then you are correct , but JVM is intelligent hotspot compiler.

Code goes through various optimization before it is executed, optimization list is huge and it will required blog series to just touch on it.

In this blog i will discuss about Intrinsic function.

As per WikiPedia

In compiler theory, an intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it substitutes a sequence of automatically generated instructions for the original function call, similar to an inline function. Unlike an inline function though, the compiler has an intimate knowledge of the intrinsic function and can therefore better integrate it and optimize it for the situation. This is also called builtin function in many languages.
Compilers that implement intrinsic functions generally enable them only when the user has requested optimization, falling back to a default implementation provided by the language runtime environment otherwise.

So in other words, some function are just like native function, it is different from native function written by us.

Compiler makes decision to use native/special function if is available other wise if will fallback to default implementation.

Unsafe.getAndAddInt is special function that gets advantage of optimization, but normal code is
still required for case when optimization is not requested or not available

To verify this we have to look at Assembly code generated.

how-to-print-dissasembly-from-jit-code article has all the details you need for this.

You can also download dll/lib for windows/linux from github
For below code snippet, optimization kicks in

Assembly code for above code looks like below ( look at line number 7, it is xadd)

Java maintains list of all intrinsic functions in code @ vmSymbols.hpp

Hotspot compiler is full of magic:-)