Sunday, 19 May 2013

Lock Less Java Object Pool

It is being while i wrote anything, i has been busy with my new job that involve doing some interesting work in performance tuning. One of the challenge is to reduce object creation during critical part of application.

Garbage Collection hiccups has been main pain point in java for some time, although java has improved over time with GC algorithmic. Azul is market leader developing pause less GC but Azul JVM are not free as speech!

Creating too many temporary/garbage object does't work too well because it create work for GC and it is going to have effect on latency, too much garbage also does't work well with multi core system because it causes cache pollution.

So how should we fix this ?

Garbage less coding
This is only possible if you know how many object you need upfront and pre-allocate them, but in reality that is very difficult to find that , but in-case if you still managed to do that then you have to worry about another issue

  • You might not have enough memory to hold all the object you need
  • You have to handle concurrency
So what is the solution for above problem
There is Object Pool design pattern that can address both of the above issue,it lets you to specify num of object that you need in pool and handles concurrent request to serve object request.

Object Pool has been base of many application that has low latency requirement, flavor of object pool is Flyweight design pattern.

Both of above pattern will help us in avoiding object creation, that is great so now GC work is reduced and in theory our application performance should improve but in practical does't happen that way because Object Pool/Flyweight has to handle concurrency and  whatever advantage you get by avoiding object creation is lost because of concurrency issue.

What are most common way to handle concurrency
Object pool is typical producer/consumer problem and it can be solved by using following techniques 

Synchronized - This was the only way to handle concurrency before JDK 1.5, apache has written wonder full object pool API based on synchronized 

Locks   - Java added excellent support for concurrent programming since JDK 1.5, there has been some work to use Locks to develop Object Pool for eg furious-objectpool

Lock Free - I could not find any implementation that is built using fully lock free technique, but furious-objectpool use mix of ArrayBlocking queue & ConcurrentLinked queue

Lets measure performance
In this test i have created pool of 1 Million object and those object are accessed by different pool implementation, objects are taken from pool and return back to pool.

This test first starts with 1 thread and then number of threads are increased to measure how different pool implementation perform under contention


.
X Axis - No Of Threads
Y Axis - Time in Ms - Lower time is better

This test include pool from Apache, Furious Pool & ArrayBlocking based Pool

Apache one is worst and as number of threads increase performance degrades further and reason for same is Apache pool is based on heavy use of "synchronized" 

Other two Furious & ArrayBlocking based pool performs better but both of them also slows down as contention increase. 

ArrayBlocking queue based pool takes around 1000 ms for 1 Million items when 12 threads are trying to access the pool, Furious pool which internally uses Arrayblocking queue takes around 1975 ms for same thing. 

I have to do some more detail investigation to find out why Furious is taking double time because it is also based on ArrayBlocking queue.

Performance of arrayblocking queue is decent but it is lock based approach, what type of performance we can get if we can implement lock free pool.

Lock free pool.
Implementing lock free pool is not impossible but bit difficult because you have to handle multiple producer & consumer.

I will implement hybrid pool which will use lock on the producer side & non blocking technique on the consumer side.

Lets have look some numbers 

I performed same test with new implementation (FastPool) and it is almost 30% faster than ArayBlocking queue.

30% improvement is not bad, it can definitely help is meeting latency goal.

What makes Fast Pool fast!
I used couple of technique to make it fast
  • Producer are lock based - Multiple producer are managed using locks, this is same as Array Blocking queue, so nothing great about this.


  • Immediate publication of released item - it publishes element before lock is released using cheap memory barrier. This gives some gain


  • Consumer are non blocking - CAS is used to achieve this, consumer are never blocked due to producer. Array Blocking queue blocks consumer because it use same lock for producer & consumer


  • Thread Local to maintain value locality -  Thread Local is used to acquire last value that was used, this reduces contention to great extent. 
If you are interested in having look at code then it is available @ FastObjectPool.java 

14 comments:

  1. Would it be possible to license the FastObjectPool under Apache License 2? :-)

    ReplyDelete
    Replies
    1. Thanks. That is great idea, i can do that but before that i have to tidy up code because it was written to experiment with object pool

      Delete
    2. Cool thanks I think about to use it as a part of Apache DirectMemory and so it would be great to have it under an AL2 license (or something else compatible with AL2 :-))

      Thanks Chris

      Delete
    3. Din't find time to do chages, but i have put it on github with AL2
      Link - https://github.com/ashkrit/blog/tree/master/FastObjectPool

      Apache DirectMemory seems interesting project, is it used in any large scale project ?

      Delete
  2. Is there any chance of modifying your fastObjectPool so it behaves more like a ConcurrentLinkedqueue so that the take and release return and consume type T instead of holder objects so users of the library don’t have to mess around with managing holder objects everywhere.

    It would also be great if you didn’t have to set the size at creation so that the pool grows on demand as objects are released to the pool.

    The problem with ConcurrentLinkedqueue is that it’s offer method creates garbage with line 327>
    final Node newNode = new Node(e);

    you’r fastObjectPool does not seem to have this problem which makes it better suited for an object pool as well as its speed.

    ReplyDelete
    Replies
    1. It is harder than you think to remove the requirement to have the Holder class, but the problem is that you have to maintain state of the object somewhere. Alternatively you can store the state in a separate array inside the ObjectPool for all objects in the pool and do a lookup whenever you add/remove an object from the pool. Unfortunately to make such a system thread safe again would probably lose a significant amount of throughput.

      Delete
  3. You idea will make it much clean , but reason why i did that way was to maintain state related object whether it is used/free.
    If i start returning T object then i have to also find then better way of threadlocal optimization that i have done for reducing contention.

    Other option that came to my mind while implementation was exploring dynamic proxy but then it will be over engineering for simple problem, so i choose this trade off.

    Regarding - flexible size, most of the object pool that i have seen/used are bounded by size, it is good to have that way to make your system salable.
    If load is more that pool size then it is better to have proper waiting strategy or adjust pool size declarative.

    You are right java concurrentlink based implementation create garbage & are unbounded due to which you system can behave very differently under burst traffic.
    Another reason of not using ConcurrentLinkQueue is that it does random memory walk because it is based on linked list, random memory walk is not good for low latency system.

    ReplyDelete
  4. 1. Performance

    Object pooling provides better application performance As object creation is not done when client actually need it to perform some operation on it Instead objects are already created in the pool and readily available to perform any time. So Object creation activity is done much before So it does help in achieving better run-time performance

    2. Object sharing :

    Object Pooling encourage the concept of sharing. Objects available in pool can be shared among multiple worker threads . One thread Use the Object and once used it returns back to its Object pool and then it can be used by some other worker thread. So once created objects are not destroyed and thus destruction and creation again and again is not required. That again help in generating better performing code.

    3. Control on Object instances :
    By declaring size of Object pool we can control the no of instance creation. Thus a finite no of objects are created as decided depending upon required application capacity and scalability or peak load.

    4. Memory conservation :
    Finite no of instances are created So it helps in better memory management . Too many instances are not

    Read through extensive details here :

    http://efectivejava.blogspot.in/2013/09/8-reasons-why-object-pooling-is.html

    ReplyDelete
  5. Nice article Ashkrit. would like to see a comparision of using a disruptor pattern to solve the same issue by using a ring buffer which references to the object. This i believe might be better performant as its a CAS between producers, a CAS between consumers and a volatile for publishing. No lock anywhere ( as you claim to have one among the producers in FastPool).

    ReplyDelete
    Replies
    1. Nice to hear that you find it useful.
      It is good idea to benchmark it against disruptor , but disruptor gives message ordering guarantee and it is little difficult to use such library for object pooling, but i will give thought over it.

      Delete
  6. //if(holder!=null && THE_UNSAFE.compareAndSwapObject(objects, (index*INDEXSCALE)+BASE, holder, null))
    if (holder != null && THE_UNSAFE.compareAndSwapObject(objects, (index << ASHIFT) + BASE, holder, null)) {

    could you please explain how you replaced multiplication with shift? especially ASHIFT part

    ReplyDelete
    Replies
    1. I am getting the same offset result for indexScale=2 and =3 - is it an alignment thing or what?

      Delete
    2. It is not same as multiplication, that expression it used for calculation of offset in memory for accessing specific element of index.

      You can see similar examples in AtomicIntegerArray class and i have taken it from that class.

      http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/atomic/AtomicIntegerArray.java#AtomicIntegerArray.checkedByteOffset%28int%29


      Index scale (arrayIndexScale) which is reference size and it depends on JVM size , for anything less then 32g it is 4 and for anything over it is 8. I have not tested this on bigger JVM but JDK codes expects indexscale to power of 2.

      Ashift is derived using numberOfLeadingZeros of indexScale, so for 4/8 possible values are 2/3 .

      Delete
  7. Admiring the time and effort you put into your blog and detailed information you offer!.. spicewood custom pool builder

    ReplyDelete