Thursday, 18 June 2015

Integer Encoding Magic

Question was asked to me on serializable/externalizable and why externalizable is fast , my answer was not convincing , so i spend some time on google to get more details and web has tons of information on it.

I thought of spending some time on open source serialization framework to check what type of optimization are done to reduce size.

One of the technique for integer type is use encoding, this technique is based on very simple rule.

Integer is of 4 Byte and if value is small then we don't have to use 4 byte to store it, so that way for small input values this technique can reduce size by great extent.

Some bitwise magic is required to achieve this.
I did quick compare of Raw vs Encoding and results were very good. In this test i write 1 Million integer values starting from 0 .

Without encoding it took 4019 KB for 1 Million and with byte encoding it took 2991 KB, encoding gives around 25% gain for this sample test.

Integer Encoding
Code snippet for integer encoding, it is taken from Apache Avro

 Integer encoding technique gives good gain for size , but it adds overhead on write/read side, so it should be used by knowing the  side effect , such approach is very good for serialization framework because you want to reduce size of data that is sent over wire or written to some persistent store.

Coded used for this blog is available @ github

Thursday, 7 May 2015

Experiment with String Split

Splitting string based on some token is very common operation in application, in this blog i will share some options that are mostly used and types of overhead involved in it.

This is the most common approach and it looks harmless unless you look at the code!
First it creates ArrayList for storing values and then this arraylist is converted to String array

This function produces too much of garbage and it provides only one way to extract values.

String Tokenizer 
This is much better than String.split because it does not need intermediate buffer like ArrayList/String Array to hold the values, but it creates String objects for each token which adds to garbage.

One of the good thing about StringTokenizer is that it does not force caller to use fixed data structure , so caller is free to decide on what to do with values, it is just like Streaming operation.

String.split vs StringTokenizer
Lets look at some performance numbers. In this test i take below sample string

String[] values = {

Each line is split 10 million times

So definitely StringTokenizer is winner in this case and it is because it produces less garbage as compared to String.split but it still produces string objects.

It is possible to avoid creating those String object also by using Recycle Charsequence which is just like String but gives lots of flexibility.

Lets look at another implementation using recycle charsequence approach.

It is very simple technique to avoid intermediate string object, lets look at performance number using this approach.

Recycle charSequence shines in this test, it is around 3X times faster than String.split an 2X times faster than StringTokenizer.

Code available @ github

Wednesday, 14 January 2015

Java8 Parallel Streams

Recently found great article on when to use Parallel Stream by Doug lea and team. In this blog i will share some performance result of different data structure.

Splittable data structure
This is one of the most important factor that needs to be consider. All the parallel stream operation are executed using default fork join pool and fork join pool uses Divide and conquer algorithms to split stream in small chunks and apply function on small chunks, so splitting is important factor and if splitting is going to take more time then all computation is going to choke!

Types of datastructure

Array based data structure are most efficient data structure from splitting perspective because each element can be randomly accessed , so splitting is very quick operation.
Some of the example of array based collections are ArrayList & open addressing based Set/Map.

Balanced tree based collection like TreeMap or ConcurrentHashMap. It is easy to chop the collection into 2 parts, but if tree is not balanced then splitting will be not that efficient because task load will be not equal for each thread.
Another thing to consider is that memory access pattern for tree based data structure is random , so there will be memory latency cost when elements are accessed.

This type of data structure gives worst splitting performance because each element must be visited to split it. Some of the samples from JDK are LinkedList,Queues

- Others
This for all others type of datastructure for eg I/O based , BufferedReader.lines which returns stream but splitting operation is very expensive.

Some performance numbers
All performance tuning guess must be backed by experiment, so lets have look at some numbers.

Some code snippet
Array Stream
ArrayList Stream
Set Stream
LinkedList Stream

Measurement Code


Array 31 sec
ArrayList 31 sec
set 52 sec
linkedlist 106 sec

Result confirms the guess that Array based collections are fastest, so choose right datastructure before you start using parallel streams.

Code used in blog is available @ github

Thursday, 25 December 2014

what is new in java8 ConcurrentHashMap

Java8 has lots of new features that will help in writing code that express its intent more clearly and in concise way.
ConcurrentHashMap has lots of nice features that reduces boiler plate code, most of the new features are parallel, so you don't have to worry about locks/threads etc when using CHM.

Smart Cache
One of the most common usage of CHM is to build cache and it has little concurrency issue that value for key should be computed once and it should provide all the "happens after" guarantee.

Pre JDK8 you have to do some thing like 

JDK8 has special function computeIfAbsent for such type of thing

Difference is obvious for prejdk8 future task is required and some other special handling to make sure it works fine, but JDK8 just call to computeIfAbsent with lamdba expression and your are done.
There is computeIfPresent function also to remap/refresh value in cache.

Merge values 
One of the most common problem with maps is when values of same keys needs to merged, one of the example is you have map of word & count.
Every time word is added to map and if it exists then count needs to be incremented

Example code pre JDK8

Example code in JDK8

All the noise that was there in JDK7 code is removed and code express its intent so nicely

 Aggregate/Reduce values

This is also very common use case where single value is derived from values in map for eg total number of words

sample code for pre JDK8

sample code for JDK8

Code is so so much clean, extra thing to note in jdk8 reduce function is first parameter which is the (estimated) number of elements needed for this operation to be executed in parallel.
So this function supports parallel reducing

All the collection of JDK8 has done improvement on iteration method, so application code will never have any loop, it is all controlled by underlying collection and due to which iteration can be done in parallel .

CHM has 2 types of foreach

others interesting functions are search

One thing to note is that all the parallel function are executed on common forkjoin pool, there is way by which you can provide your own pool to execute task.

Saturday, 15 November 2014

Recommend books @ agile singapore

I recently attended agilesingapore conference. This was my first agile conference, lot to take away.
Amazing speakers were invited, keynote talks were very amazing and inspirational.

One of the things that you get from such type of conference is books recommendation. Lots of books were recommended. I only remember few of them.

I want to put it somewhere so that i can start making my read list for next couple of months.

Mindset: The New Psychology of Success

Fearless Change: Patterns for Introducing New Ideas  

101 Things I Learned in Architecture School

The Timeless Way of Building

Software Craftsmanship: The New Imperative

The Laws of Simplicity (Simplicity: Design, Technology, Business, Life)

97 Things Every Programmer Should Know: Collective Wisdom from the Experts

Growing Object-Oriented Software, Guided by Tests

Extreme Programming Explained: Embrace Change

The Click Moment: Seizing Opportunity in an Unpredictable World

The Fifth Discipline: The Art & Practice of The Learning Organization

Joy, Inc.: How We Built a Workplace People Love

Lot of books to read i have to start taking timeout out from my so called busy schedule!

Friday, 10 October 2014

Factory Without IF-ELSE

Object Oriented Language has very powerful feature of Polymorphism, it is used to remove if/else or switch case in code.

Code without condition is easy to read. There are some places where you have to put them and one of such example is Factory/ServiceProvider class.
I am sure you have seen factory class with IF-ELSEIF which keeps on getting big.

In this blog i will share some techniques that you can be used to remove condition in factory class.

I will use below code snippet as example

This is first thing that comes to mind when you want to remove conditions. You get the feeling of framework developer!

This looks very simple but only problem is caller has to remember fully qualified class name and some time it could be issue.

Map can be used to to map actual class instance to some user friendly name

This also looks neat without overhead of reflection.

This is interesting one

This method is using enum method to remove condition, one of the issue is that you need Enum for each type. You don't want to create tons of them!
I personally like this method.

If-else or switch case makes code difficult to understand, we should try to avoid them as much as possible. Language construct should be used to avoid some of switch case.

We should try to code without IF-ELSE and that will force us to come up with better solution.

Wednesday, 24 September 2014

Working Effectively with Legacy Code

Recently i attended Working Effectively with Legacy Code course by Michael Feathers

It was very good learning experience , he talks about how to work with code that does not have unit test or less unit test. He shared techniques can be used to improve legacy code and get better understanding of application.

This post is to share some of them before i forget:-)

Sprout Method/Class
This is pretty common technique but did't know that it has name. Adding new method/class sounds much easier than changing some existing code for new feature, so we should use this approach for any new feature that we want to introduce.

This approach can be used on existing method also to make it testable.
Work of caution that you don't want over do it !

Poor Man dependency injection
Everybody knows what dependency injection is, apart from some of the benefit it can also be used to make code testable, so for unit test you can have sample/dummy implementation that can be injected to your application code to make is testable. 

One of the problem with this technique is that it will result in method signature changes and that might mean that all the code in that call stack might need to change.
Just remember that you don't have to use spring to do this!

Extract and override
This is interesting one looks like silver bullet or Brahmastra for many problem.This pattern is used to have control on dependency that are hard to fake.
Assume there is function that makes some database/socket call and then does some calculation and you want to write unit test for calculation logic.

To make function testable you can do below changes
  • Extract database part of logic and put that in new function
  • Make that function protected. Thanks to OOPs , finally some good usecase for protected.
  • Write new class that extends original class and only override database interaction function(i.e was protected)
  • Use new class for your testing and you are done!

This is very powerful technique because you don't have to go through pain of changing  constructor/method parameter, so call stack remains intact.
Since it is based on method overriding, so you need to have discipline in team to make sure that fake class is only used for testing. 

Instance delegate
Used to test static function class. Create normal function that will delegate call to static function and during test create another class that will override new function that was created to add fake call .

Singleton make life very difficult from testing perspective, way to make it testable is allow injection of new singleton implementation and use that implementation for testing. 

Create interface to break big class
Although adding new method/class is much simpler but still lot of code is added to existing class/method and over period of time it becomes big i mean really big.

Approach to simplify class
  • Creation of interface without any method 
  • Class that you want to simplify should implements new interface
  • All the function that was using class should now use new interface and compiler will gives you errors about missing method and you can start moving methods to interface to fix the issue.

Main advantage for this approach is that you don't have prerequisite of unit test , you can take benefit of compiler/IDE feature. 

Method & variable dependency graph
 When class grows overtime and it does not adhere to Single responsibility principle, working out what functions goes where could be tedious.
You can draw dependency graph of variable & method to workout how things are related. This can help to come up with new classes will will adhere to single responsibility principle.

Identifying seam plays important role in working out hidden dependency. 
Definition of seam from dictionary is 
A line of junction formed by sewing together two pieces of material along their margins.

Definition in context of re-factoring is - part of code will enable testing.

for e.g functions performs some I/O operation and then some calculation, so if you want to test calculation without doing any major changes to core logic then you can "Extract & Override" approach to solve this.

Scratch Refactoring
I am sure you might have seen one big class/methods with 100s or 1000s of lines and you have to scratch your head to understand what this does.
To make things more interesting that part of code is very critical to company and you have to do some changes.
Scratch Refactoring is very useful for such type of situation

  • Take monster code that you want to simplify/understand
  • This type refactoring is only focused on understanding code, so it should be done in palin text editor so that you are not worried about compilation error that IDE generates.
  • Break big method in small and single concern method, simplify condition , delete some unused code

Main benefit of this approach is that you don't have to commit this in Svn/Git only purpose of this work is understand as much as possible by creating small code blocks.

While you are doing this you will get better understanding and code is simplified to extent that doing real re-factoring will be not be that difficult.

I must say it was very useful session, lot of learning and fresh perspective.
I have got copy of Working Effectively Legacy book and will read & practice it to get more better understanding.