Showing posts with label Inter Thread communication. Show all posts
Showing posts with label Inter Thread communication. Show all posts

Friday, 14 March 2014

Off Heap concurrent counter

Concurrent counter are part of almost every system, it is used to collect data, thread synchronization etc. 
Java has good support of heap based counter.

There could be use case when you need counter that can be shared between processor.

How to build inter process counters 
 - Database 
This is first option that comes to mind, database sequence is counter that can be used by multiple process. All concurrency is handled by database. It is good option for starter but we know types of overhead(Network,Locks etc) you get with database. Only Larry Elision will be happy about it, not you!
 - Some Server
You could develop some server/middleware that provides such type of service. This option will still have network latency,marshal/unmarshal overhead.

 - Memory Mapped file
You could use memory mapped file to do this. I got idea from looking at thread-safe-interprocess-shared-memory-in-java presentation from PeterLawrey.

Challenges involved in multi process counter.
- Data visibility 
    Changes done by one process should be visible to all the process. This problem can be solved by using memory mapped file. Operating System gives guarantee about it and java memory model supports it to make is possible. 

- Thread safety 
Counters is all about multiple writers , so thread safety becomes big problem. Compare-and-swap is one option to handles multiple writers.
Is it possible to use CAS for off heap operation ? yes it is possible to do that , welcome to Unsafe.
By using using Memorymapped & Unsafe it is possible to use CAS for Off heap operation.

In this blog i will share my experiment of Memory mapped using CAS.

How ?
- How to get memory address
MappedByteBuffer is using DirectByteBuffer, which is off heap memory. So it is possible to get virtual address of memory and use unsafe to perform CAS operation. Lets look at the code.


Above code create memory mapped file of 8 bytes and get the virtual address. This address can be be used to read/write content of memory mapped file.

- How to Write/read in thread safe manner

Important function to look are readVolatile and increment
readVolatile reads directly from memory and increment is using unsafe to perform CAS on the address obtained from MemoryByteBuffer.

-Performance
Some performance numbers from my system. Each thread increments counter 1 Million times.










Performance of counter is decent , as number of threads are increased CAS failures starts to happen and performance starts to degrade.
Performance of these counter can be improved by having multiple segment to reduce write contention.
I will write about it in next blog.

Conclusion
 - Memory mapped file is very powerful, it can be used to developed lot of things like off heap collections, IPC, Off heap thread coordination etc.
  - Memory mapped file opens gates for GC less programming.

All the code used in this blog is available on github.

Sunday, 16 June 2013

Fast ReadWrite Lock



Locks are bad for performance, but still they are good fit for some scenario.

In this post i will review java Read/Write lock and alternate implementation of fast Read/Write lock 

What is Readers-Writer lock
Lets do quick re-cap of read/write lock.

As name suggest it is shared lock , it is shared between readers/writers. So you can have either reader(s) or writer. You can have multiple readers or one writer active at any given point of time. So it is very good fit for the case where number of read operation is more than write.

What is the issue with Java RW lock
By design only reader(s) or writer can be active at any given point of time, so it will cause starvation when contention is high

Advance RW lock will have additional features like
  • Priority -   for eg writer has more priority than readers or otherwise
  • Fairness  - This can be based on waiting time
It is complex to build lock with these set of rules and many things are in control of OS, for eg scheduling of thread.
With all these issues rw locks are still used.

How does RW lock perform.
Lets try to measure rw lock performance under different scenario

Details of problem that will be used for testing

Read/Write to Hashmap, writer threads adds 1 Million entry to map and reader tries to read those value back, for write operation Write lock is used and for read operation Read lock is used.
I will try to test java rw lock under no-contention and gradually try to increase contention to see how it performs.

Legend to read graph

1W - One writer
1W-1R - One writer & One reader
1W-2R - One write & Two reader 

Writer Performance



X Axis - Writer/reader Threads
Y Axis - Time to 1 Milli entry









Case when there is "1 Writer & 1 Reader" shows wired number on my system, i will ignore that for comparison.

When there is no contention writer takes around 73 ms and as we add more reader threads contention increases, it starts to slow down, in my example when total thread count reaches to 4 performance drops by 10 times. That is big drop for "1Writer-3Reader" scenario, just think what will happen if more readers are added.

Writer Throughput
Lets have look at throughput



X Axis - Writer/reader Threads
Y Axis - Throughput/Sec








I will ignore case "1W-1R" 

Throughput also takes big hit under contention, it falls from sky to roof, for 1 writer throughput is 14 million and once total thread count reaches to 4 throughput drops to 1.4 million which is 90% less.

Writer is just one part of RW lock, lets look at  the performance of reader.

Reader Performance

X Axis - Writer/reader Threads
Y Axis - Time to add 1 Million entry








Performance degrade trend continues in readers also.

Reader Throughput

Reader throughput also takes hit under contention.








Readers without writer contention
This scenario is very realistic because 

  • Mostly writer thread will populate data at application start up time
  • Or based on some write event to update data incrementally

90% + times application only reads the data and it is the performance of readers threads that will matter most by the end of day. Lets have look at performance when only readers are active.




  
In this test data in hashmap was pre-populated and only readers threads were executing.
Numbers now looks real :-) . 

It takes more time to read data as number of readers increase, same thing is observed for throughput also, it drops.

Alternate Reader/Writer Lock
Lets compare alternate implementation of RW lock

Readers without writer contention

Fast Lock outperforms java ReadWriteLock in both the test. 

In the first test, time taken to read is almost flat with Fast RW Lock and for java based it is going up like stock price as we keep on adding more readers thread. 

In throughput test java rw lock starts from 28 million and it drops to 1.6 million, that is 95% drop.
Fast rw lock starts from 58 million and drops to 34 million, that is around 40% drop.

Under heavy contention(i.e 4 readers), java readerwriter lock is around 20X times slow, that looks like comparing apple vs orange, but test result shows that.

So for readers new lock looks great, lest measure with writer.

 Writer Performance - Java RW vs Fast RW

Fastlock performs better than java RW lock under contentions 

Reader Performance - Java RW vs Fast RW



What makes FastRW lock fast
Now lets understand why new read/write lock is fast.

RW lock that i have used for testing is an implementation of SeqLock. SeqLock is very common in linux world, it provide parallelism for readers & writer. Reader never blocks writer & reader uses optimistic approach to read the data.

Readers in seqlock never update any sate in ReadWrite lock and thus reduces cpu cache traffic,so just load operation is involved for reading data.

So in nutshell it provides 

  • Less cpu cache traffic  
  • Readers never block writer
  • Reader never changes state of lock, no expensive store operation
  • Readers can slow down if write frequency is very high
  • Writer always get priority 


With so much of benefit i am sure you will never use java RW lock!

Conclusion 
SeqLock are very good alternative for reader/writer lock and especially in multicore world where we are use new technique to get best out of processor. Someday java will add support for seqlock.
Code is available @ GitHub for your reference.

Sunday, 19 May 2013

Lock Less Java Object Pool

It is being while i wrote anything, i has been busy with my new job that involve doing some interesting work in performance tuning. One of the challenge is to reduce object creation during critical part of application.

Garbage Collection hiccups has been main pain point in java for some time, although java has improved over time with GC algorithmic. Azul is market leader developing pause less GC but Azul JVM are not free as speech!

Creating too many temporary/garbage object does't work too well because it create work for GC and it is going to have effect on latency, too much garbage also does't work well with multi core system because it causes cache pollution.

So how should we fix this ?

Garbage less coding
This is only possible if you know how many object you need upfront and pre-allocate them, but in reality that is very difficult to find that , but in-case if you still managed to do that then you have to worry about another issue

  • You might not have enough memory to hold all the object you need
  • You have to handle concurrency
So what is the solution for above problem
There is Object Pool design pattern that can address both of the above issue,it lets you to specify num of object that you need in pool and handles concurrent request to serve object request.

Object Pool has been base of many application that has low latency requirement, flavor of object pool is Flyweight design pattern.

Both of above pattern will help us in avoiding object creation, that is great so now GC work is reduced and in theory our application performance should improve but in practical does't happen that way because Object Pool/Flyweight has to handle concurrency and  whatever advantage you get by avoiding object creation is lost because of concurrency issue.

What are most common way to handle concurrency
Object pool is typical producer/consumer problem and it can be solved by using following techniques 

Synchronized - This was the only way to handle concurrency before JDK 1.5, apache has written wonder full object pool API based on synchronized 

Locks   - Java added excellent support for concurrent programming since JDK 1.5, there has been some work to use Locks to develop Object Pool for eg furious-objectpool

Lock Free - I could not find any implementation that is built using fully lock free technique, but furious-objectpool use mix of ArrayBlocking queue & ConcurrentLinked queue

Lets measure performance
In this test i have created pool of 1 Million object and those object are accessed by different pool implementation, objects are taken from pool and return back to pool.

This test first starts with 1 thread and then number of threads are increased to measure how different pool implementation perform under contention


.
X Axis - No Of Threads
Y Axis - Time in Ms - Lower time is better

This test include pool from Apache, Furious Pool & ArrayBlocking based Pool

Apache one is worst and as number of threads increase performance degrades further and reason for same is Apache pool is based on heavy use of "synchronized" 

Other two Furious & ArrayBlocking based pool performs better but both of them also slows down as contention increase. 

ArrayBlocking queue based pool takes around 1000 ms for 1 Million items when 12 threads are trying to access the pool, Furious pool which internally uses Arrayblocking queue takes around 1975 ms for same thing. 

I have to do some more detail investigation to find out why Furious is taking double time because it is also based on ArrayBlocking queue.

Performance of arrayblocking queue is decent but it is lock based approach, what type of performance we can get if we can implement lock free pool.

Lock free pool.
Implementing lock free pool is not impossible but bit difficult because you have to handle multiple producer & consumer.

I will implement hybrid pool which will use lock on the producer side & non blocking technique on the consumer side.

Lets have look some numbers 

I performed same test with new implementation (FastPool) and it is almost 30% faster than ArayBlocking queue.

30% improvement is not bad, it can definitely help is meeting latency goal.

What makes Fast Pool fast!
I used couple of technique to make it fast
  • Producer are lock based - Multiple producer are managed using locks, this is same as Array Blocking queue, so nothing great about this.


  • Immediate publication of released item - it publishes element before lock is released using cheap memory barrier. This gives some gain


  • Consumer are non blocking - CAS is used to achieve this, consumer are never blocked due to producer. Array Blocking queue blocks consumer because it use same lock for producer & consumer


  • Thread Local to maintain value locality -  Thread Local is used to acquire last value that was used, this reduces contention to great extent. 
If you are interested in having look at code then it is available @ FastObjectPool.java 

Friday, 30 November 2012

Lock Free bounded queue

Lock free bounded queue

JDK 1.5 was first step towards adding serious concurrency support to java, it changed  the way you write concurrent program and in other version 1.6, 1.7 etc more concurrency support. 

One of the important building block for concurrent programming is Queue, java has lot of them ArrayBlockingQueue, ArrayDeque, ConcurrentLinkedQueue, DelayQueue, LinkedBlockingDeque, LinkedBlockingQueue, PriorityBlockingQueue, PriorityQueue, SynchronousQueue.
I am sure you will find one Queue that will fit your need!

Locks are used in most of the queue for managing concurrent access to data structure because of obvious reason of simplicity to use, lock based strategy makes it difficult to develop low latency system, it could result in priority inversion, context switching, deadlock etc 
Alternate way is using Non Blocking Algorithm , which is bit difficult to develop, one of the non blocking algorithm is lock free algorithm

Java added support for look free via java.util.concurrent.atomic package, so now time for some action with lock free api.

Java 1.5+ has lock free queue ConcurrentLinkedQueue, which is very fast but it is unbounded and such type queue can be very risky to use real application because most of the time queue is not balanced, it is either full or empty and if it is unbounded then system can run out of resource.

I will implement lock free bounded queue and measure it performance against lock based bounded queue(ArrayBlocking), lock free unbounded queue(ConcurrentLinkedQueue). I chose these queue because they are fastest queue available in JDK.

Number Game




  Test Envinorment Details
  OS : Windows 7, 64 bit
  Processor : Intel (R) Core(TM) i7-2820QM CPU @ 2.30 GHz
  Processor Arch : Sandy Bridge , 4 physical cores/8 threads

  For this test 10 Million message was added to queue using single producer and consumer number are in range of 1 to 5, non blocking call(offer/poll) call is used to produce/consume message from queue, performance is best when there is 1 consumer and as number of consumer increase performance degrades, so it is evident that contention is the culprit. 

  Cpu Usage

Cpu usage for LockFreeBoundedQueue





Cpu usage for ArrayBlockingQueue

From CPU usage it is clear that contention is causing drop in cpu usage and with lock free queue CPU usage is in acceptable range.

Test Code of LockFreeBoundedQueue

Link To Code

Above code is just an idea to demonstrate that how powerful non blocking algorithm can be. This code can be enhanced to add more feature like multiple producer, support for blocking call put/take.
Before adding blocking support(put/take) to queue we have to think about blocking strategy, there are couple of option spinning, hybrid spinning etc.

  

Friday, 2 November 2012

Power of Java MemoryMapped File

Power of Java MemoryMapped File


In JDK 1.4 interesting feature of Memory mapped file was added to java, which allow to map any file to OS memory for efficient reading. Memory mapped file can be used to developed  IPC type of solution. This article is experiment with memory mapped file to create IPC.

Some details about Memory Mapped File, definition from WIKI


A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. This resource is typically a file that is physically present on-disk, but can also be a device, shared memory object, or other resource that the operating system can reference through a file descriptor. Once present, this correlation between the file and the memory space permits applications to treat the mapped portion as if it were primary memory.


Sample Program

There are two java program one is writer and other is reader. Writer is producer and tries to write to Memory Mapped file , reader is consumer and it reads message from memory mapped file. This is just a sample program to show to idea, it does't handle many edge case but good enough to build something on top of memory mapped file.

MemoryMapWriter


import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

public class MemoryMapWriter {

public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
File f = new File("c:/tmp/mapped.txt");
f.delete();
FileChannel fc = new RandomAccessFile(f, "rw").getChannel();
long bufferSize=8*1000;
MappedByteBuffer mem =fc.map(FileChannel.MapMode.READ_WRITE, 0, bufferSize);
int start = 0;
long counter=1;
long HUNDREDK=100000;
long startT = System.currentTimeMillis();
long noOfMessage = HUNDREDK * 10 * 10;
for(;;)
{
if(!mem.hasRemaining())
{
start+=mem.position();
mem =fc.map(FileChannel.MapMode.READ_WRITE, start, bufferSize);
}
mem.putLong(counter);
counter++;
if(counter > noOfMessage )
break;
}
long endT = System.currentTimeMillis();
long tot = endT - startT;
System.out.println(String.format("No Of Message %s , Time(ms) %s ",noOfMessage, tot)) ;
}

}