Wednesday, 25 October 2017

Data structure for big data

Today crunching Terabyte of data is very common and it brings up interesting challenges.

Memory is getting cheaper and it is enabling application to keep all the data in memory and process it but some data set are very huge and does not fit in single machine, incase it fits in memory but latency of processing will be not a acceptable.

Lets take a example to understand this

You are building web analytics product that collects all the clicks/page load etc events, each event object will have some of these attributes.

 Volume generated by such type of tracking is huge and to ask any interesting question on such data set is resource intensive.   

Lets look at some of the resource intensive questions 

- How many distinct ip address are present
- How many distinct visitor Id
- Top 10 Pages
- Top X Visitor
- How many times Top X pages are requested
- Any page requested by blacklisted IP or user 

Simple way to solve distinct query is to build set and count the items in set
Building set is going to take memory proportional to number of distinct element and set has some load factor to make sure read/write operation are constant time. So memory wise it is super expensive.

Key insight to solve the problem is how accurate distinct count should be , it is OK to have some error(i.e 5 to 10%) and if answer is yes then some magic can done.

Since this problem is about just the count , so we don't have to maintain real value and that will give huge saving, algorithm needs some way to track  whether current value is seen or not.

at high level algorithm is 
 - Compute the hash
 - Find the index in bit vector
- Mark the vector if is not set and increment the distinct counter.

It is very simple algorithm and does the job, it is called Linear Counting.
This works well but it has some limitation that you need to workout size which controls accuracy to get best result it should be close to number of distinct values you expect to see.

There is another class of algorithm called LogLog/Hyper LogLog which more powerful than Linear Counting.

Hash value is represented as binary value and then it is split in sub-string
One part is used to extract which bucket value should go
Next part is used to calculating value and it is done by counting no of leading zeros.

By using this technique each entry is put in different buckets for eg values starting with 01- Goes in X, values starting 001 goes in Y bucket.

To compute the final value all the buckets values are taken and it is averaged by applying some factor.

By using this technique billion elements can be counted by using very less memory.

HyperLogLog paper contains all the details of algorithm.
Linear Counting or Hyper Log can be used to answer below questions
- How many distinct ip address are present
- How many distinct visitor Id
- Distinct visitor by region, time etc
- Trending words like google trends
- Near similar document matching to remove duplicate document, it is used by search engine.

This data structure can be used for quick anomaly/fraud detection.
This also can be used for Nested_loop_join , spark is using it in ML lib for MatrixFactorizationModel and few more places to count keys in RDD.    

One of the nice properties of these data structure is merge operation.
for e.g. one hyper log can be maintained for unique visitor for each day and they can merged to get value for any time range.

Lets look at some memory usage numbers, it is computed using stream-lib
In this benchmark i used all the words from HarryPotterseries book.

Actual Count 11983

Type Memory ( KB) Count Error %
Set 431 11983 0
LinearCounting 31 12167 1.5
HyperLog 14 12253 2.2

In less than 1% of memory with 2 % of error this data structure is able to do cool thing !

Another interesting data structure for membership query is bloom filter, i wrote about in mr-cool-bloom-filter blog post.

Code used for bench-marking is available @

Saturday, 16 September 2017

Unix design patterns and philosophy

Design patterns got very famous after release of Design-Patterns-Elements-Reusable-Object-Oriented book and it became one of the advance topic of software engineering.

It become one of the must book to have it on desk, it also gave common vocabulary to talk about design. Design pattern idea was extended to capture solution to recurring problem. 

Linda Rising took same idea and wrote book on Fearless Change Patterns Introducing Ideas , she wrote part2 of book More Fearless Change and i am sure lot more books will come in future which will present patterns idea in different shapes & forms.

In this blog i will share design patterns used in UNIX.
All the patterns are 30+ year old but are so much relevant in microservices time.

Lets start with famous quote from  "C. A. R. Hoare"

"There are two ways of constructing a software design. One is to make it so simple that there are obviously no deficiencies; the other is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult."

The Emperor’s Old Clothes, CACM February 1981
—C. A. R. Hoare

Bottom up design

First design philosophy is unix is build using bottom up approach, it is build from lego blocks, built small pieces and then compose them to create something more useful. Functional programming is based on this principal !

Make each program do one thing well 

    This is such a simple rule but not used enough, developer loves complexity and this rule is violated everyday and results in software that does too many things but many features are not implemented properly.

Some of the example from UNIX of one thing well are grep, diff , patch , Yacc.
These programs are bugfree and so much that functionality is taken for graunted.

Design programs to be connected with other programs(Composition)

This is the rule that invented PIPE (i.e | ) , if our program are build using this rule then lot of time/money spent on integration can be saved.
In Unix world binary input/output is avoided and this makes integration so easy.
Unix program separate interface from logic, interface is "Text" stream and produces "Text" stream.

Each program is independent and they can be composed together because they follow simple rule of "text" interface.
As a developer we like to do premature optimization and start with binary input/output, use binary format only when bottleneck is proved.


Build software that can be tried early may be in in few days or 1 week. 

This philosophy is also applicable for complex software like Operating System/ Device Driver etc, today we have to take help of Agile to build software that can be tried in 2 or 3 weeks.
Unix guys were doing Agile in 1970.

Avoid Fancy algorithm

Fany algorithm are complex to implement and have bugs and gives dividend only when N is above some threshold, by using simple algo and data structure lot of complex problem can be solved.

Data dominates 
Spend more time on building data structure that will organize data properly.

Write simple parts connected by clean interface 
This is the most difficult on to get right, we have seen different programming paradigm starting form assembler , structural , object oriented , functional etc but all of them have failed, as soon as program turns from POC to real it can't fit in head of human brain.

Clarity over Cleverness 

I am sure every developer will have some war story to tell about debugging Clever program.

Rule of Transparency
This is one the things that is thought  on last day of development in-spite of knowing that this rule is to make inspection and debugging of program easier.

Rule of Repair

When you fail, fail noisily and as soon as possible.

Single Point of Truth(SPOT) rule

This rules says that every piece of knowledge must have single, unambiguous representation in system. Repetition leads to inconsistency so use Constants, tables, metadata, code generator . Complicity is cost , don't pay it twice.  

All the philosophy boils down to

Monday, 24 July 2017

Java Intrinsic magic

I got question on atomicinteger-java-7-vs-java-8 blog about implementation of unsafe.getAndAddInt function

It looks like JDK8 is still using CAS version.
If you just look at the code in IDE then you are correct , but JVM is intelligent hotspot compiler.

Code goes through various optimization before it is executed, optimization list is huge and it will required blog series to just touch on it.

In this blog i will discuss about Intrinsic function

As per WikiPedia 

In compiler theory, an intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it substitutes a sequence of automatically generated instructions for the original function call, similar to an inline function. Unlike an inline function though, the compiler has an intimate knowledge of the intrinsic function and can therefore better integrate it and optimize it for the situation. This is also called builtin function in many languages.
Compilers that implement intrinsic functions generally enable them only when the user has requested optimization, falling back to a default implementation provided by the language runtime environment otherwise.

So in other words, some function are just like native function, it is different from native function written by us.

Compiler makes decision to use native/special function if is available other wise if will fallback to default implementation.

Unsafe.getAndAddInt is special function that gets advantage of optimization, but normal code is
still required for case when optimization is not requested or not available

To verify this we have to look at Assembly code generated.

how-to-print-dissasembly-from-jit-code article has all the details you need for this.

You can also download dll/lib for windows/linux from github

For below code snippet, optimization kicks in

Assembly code for above code looks like below ( look at line number 7, it is xadd)

Java maintains list of all intrinsic functions in code @ vmSymbols.hpp

Hotspot compiler is full of magic:-)

Tuesday, 27 June 2017

Programmer Oath

Uncle Bob is very famous for his ideas on Clean Code. He is very hardliner on Test driven development.

I am big follower of him and have read few books Clean Code , Clean Coder  written by him, i have become better programmer after reading his books.

I read Clean Coder few years back and felt that this book is must read for every professional software developer.

Uncle bob recently gave a talk The Scribe's Oath , that summaries the whole book in 1 hour.

In this talk he sets nice context before getting to real stuff. He talks about couple of Oaths for programmer.

Not writing harmful code 
He talks about harm software developer does by releasing defect , makes code harder for other programmer to understand. Software should be easy to change and anything that makes hard to change is harmful code.

Code that i will produce will be my best work
This is something that you are proud of.

Each release with quick , sure & repeatable proof that every element of code works as it is supposed to.
This is all the TDD, BDD, CI etc thing comes in. It is inappropriate to accept some level of defect.

Frequent small release and not impede progress
Holding code to yourself and not let anybody seeing it, holding on to big commit are few things that will impede progress.  Small releases are better.

Fearlessly and relentlessly improve the code at every opportunity and never make it worse.
Almost every software developer is slowed down by bad code , but we still continue to write bad code for some reason. We allow bad code to rot in absence of automated Unit test. This is where The Boy Scout Rule is useful . Automated test are really useful to keep this promise.

Keep my and my team productivity high.
Don't hold on the big commit, don't create test that takes hours to run.

Continuously ensure that others cover for me and i cover for them.
This one is interesting, he talks about working like sports team and handling situation when one of the team member is injured or down.
Pair Programming can be good tool to achieve this.

Produce estimate that are honest both in magnitude and precision, no promise without certainty
This is tricky one, most honest estimate is "I DON'T KNOW" but this does not work in real world and we fall in estimate trap and make false promise. He recommend to draw curve around uncertainty.
You manager will come to you and say that we need this in next 2 days , can you try ?
Your answer to this should be i am already trying and not holding back any extra capacity. You should say no, when it is not possible .

Never stop learning and improving my craft.
You have to keep on leaning new language, framework, library otherwise you will fall behind and have to become manager :-)
Links to some of Uncle bob talks that i found good

Effective Estimation (or: How not to Lie)
The Principles of Clean Architecture
What is OO really
SOLID Principles of Object Oriented and Agile Design