Friday, 11 December 2015

Mr Cool - Bloom Filter


HashCode based data structure are basis of many algorithms, it is used for fast look up , membership queries, group by etc.

HashCode is also used to build some interesting class of data structure. In this blog i will share one of such data structure that has many useful application in real world.

Take a example that you have million or billions of item and you want to keep track of distinct items.
This is very simple problem and can be solved by using HashMap but that requires huge memory depending on number of distinct items because it has to stores actual element also.

To make this problem little interesting we can put memory constraints on solution, so effectively we need space efficient solution to test membership of item.

Nothing comes for free in this world , so with less memory we have to trade off accuracy and in some case it is fine for e.g Unique Visitor on website, Distinct Page visited etc

First intuition to solve this problem by allocating N buckets and mark the bucket based on hashcode, no need to store actual value just mark slot index.







Quick question that comes to mind is what will happen in case of collision and in case of that our result will be wrong but it will be wrong by some percentage only and there are couple of ways improve result.

 - Allocate more buckets but use single hash function.
 - Use multiple hash functions and each of the hash function has its own dedicated bucket.

Multiple hash function options works very well , using that result can be around 96% correct !

So if we use multiple hash function then data structure will look something like below.















This data structure is called Bloom Filter and it maintains X bit vector and apply X hash function on input value to mark bits.

for checking if value already exists just check if bits are set to true or not by using multiple hash function.

In terms of memory requirement bit vector of X length is required and get more accurate result multiple hash based bit vector can be used. Quick math will give fair idea about memory requirement


Bloom filter is very compact in terms of memory and size can be controlled depending on requirement.
Memory can be further reduced by using compressed bit vector.

Trade Off
It is important to know trade off done to achieve the efficiency .

 - It is probabilistic data structure means result are not 100% correct but interesting thing about this data structure is that it can give FALSE POSITIVE but never FALSE NEGATIVE. Which makes it good fit for many practical application

 - Multiple hash function are used , so read/write performance will be based on hash function.

Application
- DB Query filter
One of the most common use case. Useful in reducing false query to DB

- Joins
Yes bloom filter provide alternate way to do joins especially in distributed system. So on one of the dataset(i.e smaller) build bloom filter and send it to other nodes for joining, since it is very compact in memory so transport to remote system does not adds big overhead.
One of the disruptive Big data processing system Spark is using it for joins

- Alternate to B-Tree
B-Tree can be also used for membership queries but if size of B-Tree becomes large enough that it can't fit in the memory then all the benefit is lost.
Another big data system Cassandra is using it for read request by having bloom filter type data-structure on top of data block to avoid IO operation

Another Open source in big data space Parquet is using it for fast filters.

- Partition Search
This is just another variation of above use case, Apache druid.io which is again into big data space for fast analytic is using it by partitioning data on ingestion by time and in each partition column has bloom filter type of index for fast filters.

- Safe Website filter
This is an interesting one, google chrome uses bloom filter to find if site is safe or not.


Simple implementation of Bloom Filter is available @ github
In this implementation each hash function has independent bit vector , but single common vector can be used for all the hash function.

Monday, 5 October 2015

Soft Skills - How to teach yourself

I have started reading Soft-Skills-software-developers-manual book and liked it, it is has lot of valuable information that can help in better management/planning of software development career.

All the chapters are well written but i liked chapter - "How to teach yourself" so much that i wanted to write blog about it and keep it for my future reference.

This chapter starts with what is typical learning strategy and i have to confess that i also use this simple and no so effective learning technique

- Pick a book
- Read it cover to cover and practice examples/samples given in it.

Problem with this approach is that it takes time and we might focus on lot of things that is actual not required upfront when you start working on it.

Sounds so much like water fall method :-(

John Sonmez shares his approach and it is worth reading it.  His 10-step process is pragmatic approach to learning.

- Step 01 - Get the big picture
This is about having 50,000 feet overview of topic that you want to learn , this will help in knowing how big is the topic and what are the sub topics under it.
This can be done by doing internet searches .

- Step 02 - Determine Scope
Once you know how big/complex is the topic then decided scope of your learning

Step 03 - Define Success
Since you know your scope , so it must to define success criteria , this helps in having clear picture in mind of what it will look like when you are done.

Step 04 -Find Resource
This is pretty simple given we can get tons of information from internet, find all the resource that will help like blogs/books/videos/podcast/training material etc.
You don't have to get hung up on the quality of resource, it will be filtered later.

Step 05 - Create Learning Path
This is very important part because there is natural progression for learning, so learning path should be like that, some examples can be taken from books how topics are split based on complexity

Step 06 - Filter Resource
By this time we know what are we going to learn and in what order, so filter resource that you have collected to fit your need.

Step 07 - Learn enough to get started
This is one of the step in which most of us get too much involved and loose the track. You have to avoid the temptation of going too much deep in this part. Learn only enough so that you can get started and explore the things.

Step 08 - Play around
This step is about learning by trying/experimenting things and then referring reference material to get answers of your question.

Step 09 - Learn enough to do something useful
This step is for doing deep dive and learn as much your want so that. You have to keep eye on success criteria while you are learning, so that your are on right track.

- Step 10 - Teach

Books mention below quote for this step

Tell me and I forget. Teach me and I remember. Involve
me and I learn.
—Benjamin Franklin

I highly recommend this book , it is worth spending time on this book to learn stuff that will help in building successful career.

Thursday, 1 October 2015

Lazy Array Initialization

Lazy object initialization is one of the important strategy to avoid upfront allocation and it can become little tricky if multiple threads are involved.

Some of the language like scala has language level support for lazy variable and compiler generates thread safe code for you but in java some coding is required to achieve this.

ConcurrentHashMap of JDK7 does upfront allocation of segment object and it adds to memory footprint of empty collection but in JDK8 lazy val strategy is used for initialization to save some memory for empty collection.

So what code is used to do allocation in thread safe way ? It is simple but it has some fine details that is worth looking at, below is code snippet that is used in ConcurrentHashMap


It has check on array length because that is written after memory allocation is done.

Code used in blog is available @ github

Wednesday, 30 September 2015

Wrap around design pattern in java8

Wrap around pattern is not listed in in GOF book but is very useful for problem like below

 - Loop construct for e.g do while/while/for loop
 - Stopwatch around some code.
 - Wrap checked exception with run time exception
 - Initialization and cleanup for eg Threadpool creation/destruction or file open/close etc
 - Adding context info to threads for eg request context info for logging or passing security context etc

Lot of plumbing code is required in java to do such simple things. Java8 added lamdba support and that has answer for such problems.

Using Lambda behavior can be passed as argument to any function and it is very powerful thing if you want to solve above problems.

Wrap Around
Template of wrap around function is something like below

Pre Code
Actual behavior
Post Code

WrapAround for loop



Above code is very straight forward , it has 2 functional interface one for condition & another one for block of code to execute and those 2 behavior is passed to loop function using lambda.
This allows us to introduce new construct

Lets look at some more example

WrapAround for time/stopwatch



WrapAround Closable/Exception


Java 8 has tons of feature that can make code concise and i used just one the the feature implement really useful thing.

Code used in blog is available @ github

Thursday, 24 September 2015

Performance tuning story

Recently i was doing performance tuning of application startup time, it was taking close to 30 min and cold restart was real pain.

I this blog i will share story of this performance tuning experience.

Current state of application
Application load data from database at the startup time and keeps it in memory for fast response time and to make these things interesting all loading happens using multiple threads :-)

Current loading logic is described below.











By looking at above logic it is clear that lock & database query per record must be causing problem and profiling confirmed that.
Above code went through couple of rounds of improvement before it reached to acceptable timing.

Round 1- Remove nested query

In this round per record database query was removed with one query to bring all the data required for record and then per record request were served using that master data set.

So after that changes code looks something like this.

















This gave 30% improvement , that was good starting point with little trade off of extra transient memory.

Round 2 - Reduce scope of lock

Since this code was multi threaded , so this time profiling showed hotspot on lock and way to avoid that is either remove the lock or reduce scope of lock.

Scope of lock was reduced & this allowed to break logic in 2 step

 - Read from database
 - Update cache.

Earlier database query was done after lock was acquired and with new approach it changed and that allowed all parallel request to query the database with no contention on cache.

Code looked something like this













This gave another 40% gain with little more trade off of transient memory but memory was not the issue because oracle resultset only releases memory after it is closed, so memory wise it is no significant difference.

70% of improvement was great but it has more scope, so one improvement was done to make it faster.

Round 3 - Single Writer Batch update
Now all the bottle neck was on "write to cache" step because of multiple writers and it was reduced by using Single writer doing batch update to cache.

db query reader & cache writer were connected using queue, after this change code looked something like this.
















Now lock was acquired only few time and maximum data was written using that lock, this gave around 25% gain.

With above improvement startup time was improved by 95% and it was enough to stop more experiment.

Conclusion
 - Avoid making lots of small query to database in loop.
 - Never do I/O or network call when lock is acquired.
 - Reduce scope of lock.
 - Batch expensive operation

Wednesday, 23 September 2015

Collection resize and merge policy

Collection like ArrayList/Vector are extension of plain array with no size restriction, these collection has dynamic grow/shrink behavior.

Growing/shrinking array is expensive operation and to amortized that cost different types of policy are used.

In this blog i will share some of the policy that can be used if you are building such collection.

Growing
To decided best policy to extend we need to answer question
 - When to extend collection
-  By how much

When part is easy to answer this is definitely when element are added, but main trick is in how much.
Java collection(i.e ArrayList) grow it by 50% every time, which is good option because it reduces frequent allocation & array copy, only problem with this approach is that most of the time collection is not fully filled.

There are some option to avoid wastage of space
 - If you know how much element will be required then create collection with that initial size, so that no re-sizing is required
 - Have control on how much it grows , this is easy to implement by making growth factor as parameter to collection.

Below is code snippet for Growing list


Shrink
Remove operation on collection is good hook to decided when to reduce the size and for how much some factor can be used for e.g 25%, if after remove collection is 25% free then shrink to current size.

Java ArrayList does not shrink automatically when element are removed , so this should be watched when collection is reused because it will never shrink and in may production system collections are only half filled. It will be nice to have control on shrinking when elements are deleted.

Code snippet for auto shrinking



Hybrid Growth/Shrink
Above allocation/de-allocation has couple of issue which can have big impact on performance for e.g.
 - Big array allocation
 - Copy element

Custom allocation policy can be used to overcome this problem by having one big container that has multiple slots and element are stored in it.

Element are always appended to last pre allocated slot and when it has no capacity left then create new slot and start using it, this custom allocation has few nice benefit for e.g

 - No big allocation is required
 - Zero copy overhead
 - Really big list(i.e greater than MaxInteger) can be created
 - Such type of collection is good for JDK8 collectors, because merge cost is very low.

Nothing is free this comes with some trade off
 - Access element by index can be little slow because it has to work out which slot has item

Code snippet for slots based collection.


Code used in blog is available @ github

Thursday, 27 August 2015

Indexes in Embedded database

Embedded database are very important part of application that has micro/mill second latency requirement,.
Most of the embedded database has very good read performance and has specialized types indexes to support fast read.

Reads is all about having index on field and since we are in the era where "ONE SIZE FITS ALL" theory does not work, so different types of indexes are required for different types of read.

Lets gets cracking on few types of index that is required for fast read.

Hash Index
This is very basic type of index that can be implemented using HashMap, so to build such type index just choose the field and use that as key of HashMap and you are done, very good when lookup is going to based on Key.

Unique Index
It is just another variant of hash index but Set is used to maintain distinct values and this can be used to answer queries that needs distinct values or want to check if value exists or not.

Range Index
Now it starts to become little interesting because criteria is based on range of values for e.g
 - greaterthan XX and lessthan YY
 - greaterthan XX
- lessthan XX

Tree based index comes to rescue , all such type of queries can be answered by B-Tree data structure.

- Starts With or Contains
 This is similar to LIKE type of query in database for e.g name like 'A%' or name like '%A%'
Trie data structure can be used to answer such query.

I have tried to build simple library to support above types of query but i have left concurrency aspect to keep solution simple.

Code is available @ github

Sample application has 3 types of index

Hash - For exact value based lookup on some attribute, code available @ HashDataIndex

Tree - This is useful for range queries for eg query on date field, code available @ TreeDataIndex

Contains - This is useful for doing LIKE based search, suffix Tree data structure is used for this, code available @ ContainsDataIndex

Conclusion
- Memory usage is very important factor for such type of API and java collections are not best at it , so for production type of usage memory efficient collection should be used.

- Storing raw data as java object will not give good performance for big JVM, so having compact/memory efficient representation is required as java heap grows.

-Last one that RAM has limit so it is impossible to keep raw data or index data in memory and that makes it interesting problem to solve, so some kind of partition strategy is required for data & index and only required partition of data is kept online in memory, there are many open source library that allows to store data using memory mapped files.

Thursday, 23 July 2015

Efficiency with Algorithms

Recently had look at excellent talk on Efficiency with Algorithms, Performance with Data Structures , this talk has really some good content on performance.

In this blog i will share some of ideas from above talk and few things that i have learned.

Pre processing 
This is very common trick, this is trade off between processing required when actual requests comes vs time taken to pre compute some thing, some of the good examples are.

IndexOf
This is very common string operation, almost all application needs this algorithm, java have brute force algorithm to solve this but this can become bottle neck very soon , so it is better to use Knuth–Morris–Pratt_algorithm algorithm, which pre computes table to get better performance.

Search
Sequential search vs binary search is trade off between search time or pre processing (i.e sorting) time, it starts to give return if number of compare required goes over 0(log n)

Binary search is heart of so many algorithm, this reduces expensive search operation.

Many String permutation search problems are solve using this.

Hashmap in java has nice optimization when many keys with same hashcode are added, to ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties

Indexes
Lookup is expensive operation and binary search can't answer all types of query,so you should build specialized index for fast search, some of the options are key/value, prefix/suffix , bitmap index, inverted index etc

Pre Allocate
This is classic one, so if you know how big your data structure will be then it is better to pre allocate it rather than keep it expanding multiple times and incur cost of allocation & copy.

Array based data structure doubles capacity every time re-allocation is required and this is done to amortized allocation cost, so they do big and less frequent allocation, in java world ArrayList, HashMap etc are good example of that.

Many design pattern like Object Pooling/Fly Weight etc are based on this.

Reduce Expensive/Duplicate operation
Each algorithm has some expensive operation and to make it efficient those expensive operation should be reduced, some of the common example are

HashCode
Hash code computation is expensive operation, so this should be not be computed multiple times for given key. One way is compute it once and pass it to other functions , so recomputation is not required or pre-compute it for eg String class in java pre compute hash code to save some time.

Modulus /Divide 
This is one of the expensive arithmetic operation, bit map operation can be used to do same thing but at lower cost.
If data structure is circular for eg Circular array and want to put value in free slot then Modulus operation is required and if capacity of array is power of 2 then bit wise operation ( i.e index & ( capacity-1) ) can be used to get Modulus value and this will give significant performance gain at the cost of extra memory.  This technique is used by HashMap in java

Similarly right shift ( >>) operation can be used for divide to get better performance, but now a days compiler are smart, so you get this one for free, no need to write it.

Reduce copy overhead
Array based data structure amortized re-sizing cost by increasing capacity by 2 times , this is good but overhead of copy also comes along with it, another approach is chain of arrays, so this way you only allocate one big chunk but don't have to do copy of old value, just add this new block of memory to chain of allocated blocks.
gs-collection has CompositeFastList which is build using this approach.

Batching
Expensive operation like write to file/network/database etc should be batched to get best performance.

Co-locate data
Keeping all required data together gives very good performance based on how processor works, but this is one thing that is broken in many application.
Mike Acton talk about this in Data-Oriented Design and C++ talk in details.

Column based storage are very good analytic/aggregation use case because most of the time single column data for all rows are required to answer analytic/aggregation request.

Linear probing hash tables are also good example of data co-location.

Bit Packing is another option to keep required data very close. but should be used with extra caution because this can make code complex.

Unnecessary Work
Some of algorithm suffer from this problem most common are recursive algorithm like factorial or Fibonacci etc, both of this has duplicate work problem, which can be fixed by Memoization technique.


Simplicity & readability of code should not be lost during efficiency ride because next guy has to maintain it :-)

Thursday, 16 July 2015

BitMap sorting

Data structure is most important building block of any algorithm and it makes algorithm efficient.

Bitmap is very powerful data structure and is good fit for many problems, it is used for indexing in many database.

BitMap can be also used for sorting and for some specific case bitmap sorting can out perform may famous sorting algorithm with very low memory footprint.

Take a case that you have to sort input file that has lots of number between 1 to 10 million by and using low memory footprint and for simplicity sake assume that their are no duplicates.

What are the ways to solve this problem ?

- Classic MergeSort
This is the first solution that comes to mind whenever sorting is discussed, this will produced many intermediate files and then those will be merged.

- Muti Read & Sort
In this approach input file is read multiple times for range of data, for eg first pass will read values starting from 0 to 250K and then 250K to 500K and so on.

This solution will take less memory but too much of I/O will make it very slow .

- Amazing BitSort
Bit Sorting is really good fit for this problem, allocate bit array and turn on the bits at value index and that's it, values will get sorted as it is added to this data structure .

For eg  
{1,2,5,6,7,11} value can be represented as below

Implementation of such type of data structure is simple in C/C++ but for java bitwise manipulation is required, so to keep it simple i have implemented it using boolean array.

Code snippet 


  
It is very simple to implement this powerful data structure and little trick is required for iteration over value.

Code snippet for iteration


Conclusion
Very useful data structure with low memory foot print & very fast sorting, but it has some trade off  and you should be aware of it

- If numbers of items are less then this might not give any benefit.
- If input values are sparse then lot of space is left unused because of fragmentation.

Code is available @ github

Tuesday, 7 July 2015

Joy Inc is really Joyful

I heard closing keynote talk by Richard-Sheridan at agile singapore conference where he shared how he build joyful company.

Richard has published book Joy, Inc.: How We Built a Workplace People Love to share his experience of building amazing organization.

I got copy of book few days back and started reading it, it is very difficult to put it down once you pick it up.
I have read few chapters in last 2 days and want to share few things that i liked .

Book starts with "Let's try something new, let't run an experiment", this is very good approach to plant seed for change.

Tangible Joy
Book define tangible joy as "delivering a product or service to world that's so enjoyed , in fact , that people stop you on the street and say,"really, you did that ? i love it"

By reading this i realized that how crappy product i have build :-(

Stupid Users
It has section on stupid user and context it that software industry think users are stupid and they need Dummies book to help them successfully use our beautifully designed technology because they are dump and don't understand our master piece.

Joy will lift you and your team
He gives reason why Wright brothers were able to fly and Samuel Pierpont Langley was not

Langley was trying to build an airplan. The Wright brothers wanted to fly.

This is so much applicable to our industry, we are so much after bleeding edge technology that we forget the real problem that is supposed to be solved by technology

Lots of experiment
This book is full of experiment he did to build Joy Inc.

Pair Programming
Everybody knows benefit of pairing but still many organization are reluctant to try

Wide Open space
Wide open space with no office,cubes is very good for collaborative work, but organization has Cube culture and with every promotion individual moves away from team and get locked in private office.

Rewards System
With culture change rewards system should also change , but most of the organization never change rewards system.
I think this is one of the main topic of book, few things mention in book are.

 Now you have fair idea about what rewards system should look like

Risk of staying same
This was eye opening "The risk of staying the same had been far greater than the risk of change"

Big-Dog Discussion
Big dogs in organization are special people and by looking at them voice echos in your ear
"You don't matter as much as I do"
@menlo big decision are not made in close door conference room so that people don't feel that they are not important.

Tower of knowledge
Book as very nice section about "Mr Dave" - person on team who has vast technical knowledge that no one else have.
This is such a common problem and most of the organization falls in this trap and it is very common in Banks.
Dave is bottleneck of entire organization or department and he can't go on vacation!
Pair programming is the recommended solution for Dave and organization


Books has lots of practical advise, i highly recommend this book.

Some talks/podcast by Richard
A CEO’S JOURNEY TO CREATING A CULTURE WHERE PEOPLE CAN THRIVE
Joy, Inc. | Rich Sheridan

Thursday, 18 June 2015

Integer Encoding Magic

Question was asked to me on serializable/externalizable and why externalizable is fast , my answer was not convincing , so i spend some time on google to get more details and web has tons of information on it.

I thought of spending some time on open source serialization framework to check what type of optimization are done to reduce size.

One of the technique for integer type is use encoding, this technique is based on very simple rule.

Integer is of 4 Byte and if value is small then we don't have to use 4 byte to store it, so that way for small input values this technique can reduce size by great extent.

Some bitwise magic is required to achieve this.
I did quick compare of Raw vs Encoding and results were very good. In this test i write 1 Million integer values starting from 0 .

Without encoding it took 4019 KB for 1 Million and with byte encoding it took 2991 KB, encoding gives around 25% gain for this sample test.

Integer Encoding
Code snippet for integer encoding, it is taken from Apache Avro


Conclusion 
 Integer encoding technique gives good gain for size , but it adds overhead on write/read side, so it should be used by knowing the  side effect , such approach is very good for serialization framework because you want to reduce size of data that is sent over wire or written to some persistent store.

Coded used for this blog is available @ github



Thursday, 7 May 2015

Experiment with String Split

Splitting string based on some token is very common operation in application, in this blog i will share some options that are mostly used and types of overhead involved in it.

String.split
This is the most common approach and it looks harmless unless you look at the code!
First it creates ArrayList for storing values and then this arraylist is converted to String array

This function produces too much of garbage and it provides only one way to extract values.

String Tokenizer 
This is much better than String.split because it does not need intermediate buffer like ArrayList/String Array to hold the values, but it creates String objects for each token which adds to garbage.

One of the good thing about StringTokenizer is that it does not force caller to use fixed data structure , so caller is free to decide on what to do with values, it is just like Streaming operation.

String.split vs StringTokenizer
Lets look at some performance numbers. In this test i take below sample string

String[] values = {
        "this,is,simple,test",
        "lets,see,how,it,works",
        "this,is,very,simple,test"};

Each line is split 10 million times






















So definitely StringTokenizer is winner in this case and it is because it produces less garbage as compared to String.split but it still produces string objects.

It is possible to avoid creating those String object also by using Recycle Charsequence which is just like String but gives lots of flexibility.

Lets look at another implementation using recycle charsequence approach.


It is very simple technique to avoid intermediate string object, lets look at performance number using this approach.






















Recycle charSequence shines in this test, it is around 3X times faster than String.split an 2X times faster than StringTokenizer.

Code available @ github

Wednesday, 14 January 2015

Java8 Parallel Streams


Recently found great article on when to use Parallel Stream by Doug lea and team. In this blog i will share some performance result of different data structure.

Splittable data structure
This is one of the most important factor that needs to be consider. All the parallel stream operation are executed using default fork join pool and fork join pool uses Divide and conquer algorithms to split stream in small chunks and apply function on small chunks, so splitting is important factor and if splitting is going to take more time then all computation is going to choke!

Types of datastructure

-Array
Array based data structure are most efficient data structure from splitting perspective because each element can be randomly accessed , so splitting is very quick operation.
Some of the example of array based collections are ArrayList & open addressing based Set/Map.

-Tree
Balanced tree based collection like TreeMap or ConcurrentHashMap. It is easy to chop the collection into 2 parts, but if tree is not balanced then splitting will be not that efficient because task load will be not equal for each thread.
Another thing to consider is that memory access pattern for tree based data structure is random , so there will be memory latency cost when elements are accessed.

-LinkedList
This type of data structure gives worst splitting performance because each element must be visited to split it. Some of the samples from JDK are LinkedList,Queues

- Others
This for all others type of datastructure for eg I/O based , BufferedReader.lines which returns stream but splitting operation is very expensive.

Some performance numbers
All performance tuning guess must be backed by experiment, so lets have look at some numbers.

Some code snippet
Array Stream
ArrayList Stream
Set Stream
LinkedList Stream


Measurement Code


Result 

Array 31 sec
ArrayList 31 sec
set 52 sec
linkedlist 106 sec

Result confirms the guess that Array based collections are fastest, so choose right datastructure before you start using parallel streams.

Code used in blog is available @ github