Are you ready: Concurrency

Showing posts with label Concurrency. Show all posts

Tuesday, 15 March 2022

Concurrent Heap data structure

Lets Heapify!!!

Heap is very popular data structure used for solving Top X types of problem.

For eg find the top 10 popular items by sales volume, top X users by activity etc.

PriorityQueue data structure of java is based on heap and can help in answering any top X type of query. PriorityQueue is not thread safe, so it can't be used in highly concurrent environment without adding lock.

Underlying data structure of heap is array and elements are shifted up and down to maintain the element order, for each swim operation the full array should be locked down to avoid race condition.

Even read options like poll are mutating operation due to which it is hard to share Heap with multiple threads.

Underlying algorithm makes is very hard to use heap in concurrent or parallel environment.

Heap - Source: Wikipedia

Lets look at other options to achieve heap like functionality without giving up on concurrency.

Concurrent heap data structure need following properties

Highly concurrent ordered collection.
Parallel writes/read support.
Top X type of API.
Multiple top operations supported concurrently using same instance of data structure.

Concurrent Skip list from JDK looks good candidate for this but we need to add some missing functionality.

Lets recap how Skip List data structure looks.

SkipList - Source:Wikipedia

SkipList is ordered multiple link list, it has got some fast lanes and slow lanes. Fast lanes allow to find element in approx log(n) cost.

- Unique Item identification

JDK has set and map implementation of Skiplist. Map/Set can only have unique keys, we need to find a way to tweak unique key requirement to make set behave like Heap.

We can use little trick, every value that is added to SkipList will have additional metadata that can running sequence number or timestamp, this extra metadata can be used for resolving conflict when 2 items are equal based on comparison.

Lets take product by sales use case for code samples.

SalesItem is Comparable and it compares items by sales volume.

class SalesItem implements Comparable<SalesItem> {

    private final String product;
    private final long sales;

    @Override
    public int compareTo(SalesItem o) {
        return Long.compare(sales, o.sales);
    }

We can't add SalesItem in SkipList because items having same sales volume will be rejected.

We can add another wrapper class that adds extra metadata to handle this problem. It will look something like this

class Item implements Comparable<Item> {
    private final T value;
    private final long index;

    @Override
    public int compareTo(Item o) {

        int r = this.value.compareTo(o.value);
        r = heapType.equals(HeapType.Max) ? -r : r;
        if (r != 0) {
            return r;
        }
        return Long.compare(index, o.index);
    }

index is that extra metadata that is added to handle items with same sales volume and it case of conflict it will order by index

- TopX API

For TopX API streams.limit can be used, another benifit of using streams APIs is that client application can use other cool features of Streams API.

Full Code for Concurrent Heap

public class ConcurrentHeap<T extends Comparable> {

    private final AtomicLong id = new AtomicLong();
    private final NavigableSet<Item> data = new ConcurrentSkipListSet<>();
    private final HeapType heapType;

    public void add(T value) {
        data.add(new Item(value, id.incrementAndGet()));
    }

    public Stream<T> stream() {
        return data
                .stream()
                .map(v -> v.value);
    }

    public Stream<T> top(int x) {
        return stream().limit(x);
    }

    class Item implements Comparable<Item> {
        private final T value;
        private final long index;

        @Override
        public int compareTo(ConcurrentHeap<T>.Item o) {

            int r = this.value.compareTo(o.value);
            r = heapType.equals(HeapType.Max) ? -r : r;
            if (r != 0) {
                return r;
            }
            return Long.compare(index, o.index);
        }

        Item(T value, long index) {
            this.value = value;
            this.index = index;
        }

    }

Underlying data structure that is behaving like Heap is NavigableSet, JDK has 2 implementation if this first one is TreeSet and another one is ConcurrentListSkipSet.

We can choose between TreeSet/ConcurrentListSkipSet based on need to avoid the cost of concurrency in single thread env.

Full working code for this blog post is available @ github

Saturday, 14 September 2013

Concurrent Counter With No False Sharing

This blog is continuation of Scalable Counter post.

One of the reader shared result from his system. He ran test on XEON intel processor with 16 core and total time taken for each type of counter is almost same, although Atomic counter has CAS failure & other type does't has any CAS failure but it does't make any difference in execution time.
Very strange result , needs further investigation.

Another reader pointed out that it could be due to false sharing, so it worth to take that into account and i created another class that handles FALSE SHARING

Time take by different counter

Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

PaddedAtomicCounter is new type of counter that i have added to test & it it outperforms all other counters.

It is using cache line padding to avoid false sharing.

Cacheline is of 64 byte on most of the today processor. PaddedCounter it Integer based counter so it is adding 16 slot per counter, by using this techniques we avoid cache pollution & as result we see 16X times gain as compared to AtomicCounter, without padding the gain was 5X and with cache line padding gain jumps to 16X.

With cacheline padding you need some extra space, so it is trade off of memory for speed, you can choose what you want!

CAS failure rate

Lets look at the CAS failure for different counter & what it means for performance.

Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

PaddedAtomic has some CAS failure as compared to other counters , but it does't make any difference in execution time of the counter.

CAS failure is not the only factory that can determined execution time, false sharing make significant contribution to it, so this gives good explanation of behavior seen in XEON processor

Conclusion

To get better performance you have to take care of few things

- Contention - There are many techniques to avoid it , this blogs shows one of them

- False Sharing - You have to avoid false sharing to get best out of processor, padding is required for that. Some of the JDK classes are using padding are ThreadLocalRandom , now we have @Contended annotation from java to achive same thing, it is being used in ForkAndJoinPool

Code is available @ github

Tuesday, 10 September 2013

Scalable Counters For Multi Core

Counters are required everywhere , for e.g. to find key KPI of application, load on application, total number of request served, some KPI for finding throughput of application & many more.

With all these requirement complexity of concurrency is also added & that makes this problem interesting.

How to implement concurrent counter

- Synchronized - This was the only option before JDK 1.5, since now we are waiting for JDK8 release , so definitely this is not an option.

- Lock based - You should never attempt this for counter , it will perform very badly

- Wait Free - Java does't have support for Fetch-and-add, so bit difficult to implement it.

- Lock free - With very good support of Compare-and-swap, this looks good option to use.

How does Compare-and-Swap based counter performs

I used AtomicInteger for this test and this counter is incremented for 1 Million time by each thread & to increase the contention number of threads are increased gradually.

Test Machine Details
OS : Windows 8
JDK : 1.7.0.25
CPU : Intel i7-3632QM , 8 Core
RAM : 8 GB

Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

As number of threads are increased, time taken to increment counter increases and it is due to contention.
For CAS based counter , it is CAS failure that causes slowdown.

Is this best performance that we can get ? no definitely their are better solution to implement concurrent counter, lets have look at them.

Alternate Concurrent Counter
Lets look at some of the solution to implement counter that handles contention in better way

- Core Based Counter - Maintain counter for each logical core, so that way you will have less contention. Only issue you have this type of counter is that if number of threads are more than logical core then you will start noticing contention.

- Thread Based Counter - Maintain counters for total number of threads that will be using system. This works well when number of threads are more than number of logical cores.

Lets test it

Time taken by different types of counter

Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

Concurrent Counter performs much better than Atomic based counter, for 16 threads it is around 5X times better, that is huge difference!

CAS Failure Rate

Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

Due to contention, Atomic based counter see lot of failure and it goes up exponential as i add more threads & other counters performs pretty well.

Observation
Multi core machines becoming easily available & we have to change the way we handle concurrency, traditional way of doing concurrency is not going to scale in today's time when having 24 or 48 core server is very common.

- To reduce the contention you have to use multiple counters and then aggregate them later

- Core based counter works well if number of threads will be less or same as number of cores

- Thread based counter is good when number of threads are much more than available core

- Key to reduce contention is identify counter to which thread will write,i have used simple approach based on thread id, but much better approach are available, look at ThreadLocalRandom of JDK 8 for some ideas.

- Thread based approach is used in LongAdder of JDK8, which creates many slots to reduce contention.

Code for all the counters used in this test are available @ Github

Sunday, 16 June 2013

Fast ReadWrite Lock

Locks are bad for performance, but still they are good fit for some scenario.

In this post i will review java Read/Write lock and alternate implementation of fast Read/Write lock

What is Readers-Writer lock
Lets do quick re-cap of read/write lock.

As name suggest it is shared lock , it is shared between readers/writers. So you can have either reader(s) or writer. You can have multiple readers or one writer active at any given point of time. So it is very good fit for the case where number of read operation is more than write.

What is the issue with Java RW lock
By design only reader(s) or writer can be active at any given point of time, so it will cause starvation when contention is high.

Advance RW lock will have additional features like

Priority - for eg writer has more priority than readers or otherwise
Fairness - This can be based on waiting time

It is complex to build lock with these set of rules and many things are in control of OS, for eg scheduling of thread.
With all these issues rw locks are still used.

How does RW lock perform.
Lets try to measure rw lock performance under different scenario

Details of problem that will be used for testing

Read/Write to Hashmap, writer threads adds 1 Million entry to map and reader tries to read those value back, for write operation Write lock is used and for read operation Read lock is used.
I will try to test java rw lock under no-contention and gradually try to increase contention to see how it performs.

Legend to read graph

1W - One writer
1W-1R - One writer & One reader
1W-2R - One write & Two reader

Writer Performance

X Axis - Writer/reader Threads
Y Axis - Time to 1 Milli entry

Case when there is "1 Writer & 1 Reader" shows wired number on my system, i will ignore that for comparison.

When there is no contention writer takes around 73 ms and as we add more reader threads contention increases, it starts to slow down, in my example when total thread count reaches to 4 performance drops by 10 times. That is big drop for "1Writer-3Reader" scenario, just think what will happen if more readers are added.

Writer Throughput
Lets have look at throughput

X Axis - Writer/reader Threads
Y Axis - Throughput/Sec

I will ignore case "1W-1R"

Throughput also takes big hit under contention, it falls from sky to roof, for 1 writer throughput is 14 million and once total thread count reaches to 4 throughput drops to 1.4 million which is 90% less.

Writer is just one part of RW lock, lets look at the performance of reader.

Reader Performance

X Axis - Writer/reader Threads
Y Axis - Time to add 1 Million entry

Performance degrade trend continues in readers also.

Reader Throughput

Reader throughput also takes hit under contention.

Readers without writer contention
This scenario is very realistic because

Mostly writer thread will populate data at application start up time
Or based on some write event to update data incrementally

90% + times application only reads the data and it is the performance of readers threads that will matter most by the end of day. Lets have look at performance when only readers are active.

In this test data in hashmap was pre-populated and only readers threads were executing.
Numbers now looks real :-) .

It takes more time to read data as number of readers increase, same thing is observed for throughput also, it drops.

Alternate Reader/Writer Lock
Lets compare alternate implementation of RW lock

Readers without writer contention

Fast Lock outperforms java ReadWriteLock in both the test.

In the first test, time taken to read is almost flat with Fast RW Lock and for java based it is going up like stock price as we keep on adding more readers thread.

In throughput test java rw lock starts from 28 million and it drops to 1.6 million, that is 95% drop.
Fast rw lock starts from 58 million and drops to 34 million, that is around 40% drop.

Under heavy contention(i.e 4 readers), java readerwriter lock is around 20X times slow, that looks like comparing apple vs orange, but test result shows that.

So for readers new lock looks great, lest measure with writer.

Writer Performance - Java RW vs Fast RW

Fastlock performs better than java RW lock under contentions

Reader Performance - Java RW vs Fast RW

What makes FastRW lock fast
Now lets understand why new read/write lock is fast.

RW lock that i have used for testing is an implementation of SeqLock. SeqLock is very common in linux world, it provide parallelism for readers & writer. Reader never blocks writer & reader uses optimistic approach to read the data.

Readers in seqlock never update any sate in ReadWrite lock and thus reduces cpu cache traffic,so just load operation is involved for reading data.

So in nutshell it provides

Less cpu cache traffic
Readers never block writer
Reader never changes state of lock, no expensive store operation
Readers can slow down if write frequency is very high
Writer always get priority

With so much of benefit i am sure you will never use java RW lock!

Conclusion
SeqLock are very good alternative for reader/writer lock and especially in multicore world where we are use new technique to get best out of processor. Someday java will add support for seqlock.
Code is available @ GitHub for your reference.

Sunday, 19 May 2013

Lock Less Java Object Pool

It is being while i wrote anything, i has been busy with my new job that involve doing some interesting work in performance tuning. One of the challenge is to reduce object creation during critical part of application.

Garbage Collection hiccups has been main pain point in java for some time, although java has improved over time with GC algorithmic. Azul is market leader developing pause less GC but Azul JVM are not free as speech!

Creating too many temporary/garbage object does't work too well because it create work for GC and it is going to have effect on latency, too much garbage also does't work well with multi core system because it causes cache pollution.

So how should we fix this ?

Garbage less coding
This is only possible if you know how many object you need upfront and pre-allocate them, but in reality that is very difficult to find that , but in-case if you still managed to do that then you have to worry about another issue

You might not have enough memory to hold all the object you need
You have to handle concurrency

So what is the solution for above problem

There is Object Pool design pattern that can address both of the above issue,it lets you to specify num of object that you need in pool and handles concurrent request to serve object request.

Object Pool has been base of many application that has low latency requirement, flavor of object pool is Flyweight design pattern.

Both of above pattern will help us in avoiding object creation, that is great so now GC work is reduced and in theory our application performance should improve but in practical does't happen that way because Object Pool/Flyweight has to handle concurrency and whatever advantage you get by avoiding object creation is lost because of concurrency issue.

What are most common way to handle concurrency

Object pool is typical producer/consumer problem and it can be solved by using following techniques

Synchronized - This was the only way to handle concurrency before JDK 1.5, apache has written wonder full object pool API based on synchronized

Locks - Java added excellent support for concurrent programming since JDK 1.5, there has been some work to use Locks to develop Object Pool for eg furious-objectpool

Lock Free - I could not find any implementation that is built using fully lock free technique, but furious-objectpool use mix of ArrayBlocking queue & ConcurrentLinked queue

Lets measure performance

In this test i have created pool of 1 Million object and those object are accessed by different pool implementation, objects are taken from pool and return back to pool.

This test first starts with 1 thread and then number of threads are increased to measure how different pool implementation perform under contention

X Axis - No Of Threads

Y Axis - Time in Ms - Lower time is better

This test include pool from Apache, Furious Pool & ArrayBlocking based Pool

Apache one is worst and as number of threads increase performance degrades further and reason for same is Apache pool is based on heavy use of "synchronized"

Other two Furious & ArrayBlocking based pool performs better but both of them also slows down as contention increase.

ArrayBlocking queue based pool takes around 1000 ms for 1 Million items when 12 threads are trying to access the pool, Furious pool which internally uses Arrayblocking queue takes around 1975 ms for same thing.

I have to do some more detail investigation to find out why Furious is taking double time because it is also based on ArrayBlocking queue.

Performance of arrayblocking queue is decent but it is lock based approach, what type of performance we can get if we can implement lock free pool.

Lock free pool.

Implementing lock free pool is not impossible but bit difficult because you have to handle multiple producer & consumer.

I will implement hybrid pool which will use lock on the producer side & non blocking technique on the consumer side.

Lets have look some numbers

I performed same test with new implementation (FastPool) and it is almost 30% faster than ArayBlocking queue.

30% improvement is not bad, it can definitely help is meeting latency goal.

What makes Fast Pool fast!

I used couple of technique to make it fast

Producer are lock based - Multiple producer are managed using locks, this is same as Array Blocking queue, so nothing great about this.

Immediate publication of released item - it publishes element before lock is released using cheap memory barrier. This gives some gain

Consumer are non blocking - CAS is used to achieve this, consumer are never blocked due to producer. Array Blocking queue blocks consumer because it use same lock for producer & consumer

Thread Local to maintain value locality - Thread Local is used to acquire last value that was used, this reduces contention to great extent.

If you are interested in having look at code then it is available @ FastObjectPool.java

Resources
java-object-tutorial

Monday, 18 February 2013

Experiment with ConcurrentHashmap

I am investigating memory issue in one of my recent project where data is kept in memory for fast access, but memory footprint of application is very high.
This application was heavily using CHM(i.e Concurrenthashmap), so no brainier guess was required that CHM was the issue. I did memory profiling session to find how much really CHM was taking.
I was surprised with result that CHM was taking around 50% of the memory. So it was confirmed that CHM was the issue, it is fast but not memory efficient.

Why CHM is so fat ?

Key/Values are wrapped by Map.Entry object , this create extra object per object.
Each segment is Re-entrant lock, so if you have lot of small CHM and concurrency level is default then there will be lot of lock object and that will come in top list.
and many more states for house keeping activities.

All of above objects makes very good contribution to memory.

How can we reduce memory footprint
If is difficult to reduce memory footprint of CHM and possible reason that i can think of are

It has to support old interface of Map
hash code collision technique used by java map are closed, closed hashing is based on creating Linklist on collision, closed hashing is very fast for resolution, but it is not CPU cache friendly, especially if nodes turns in big link list. There is interesting article about LinkList problem

So we need alternate CHM implementation which is memory efficient.

Version 2 of CHM

I started to create version of CHM which has low memory foot print, target is come a close as array. I also used alternate hash code collision techniques to check the performance, their many options for Open_addressing.
I tried below options.

Linear probing - performance was not that great, although this is most cpu cache friendly. Need to spend some more time to get it right.
Double_hashing - Performance was in acceptable range.

Lets measure CHM V2

Memory footprint

There is big gain in terms of memory, CHM takes around 45%+ more than raw data , new implementation LCHM is very much close to Array type.

Single thread PUT performance

CHM out performs in PUT test, new implementation is around 50 to 80 Millisecond slower for 1 Million items. 50 to 80 ms is not noticeable delay and i think it is good for application where latency requirement is in seconds and in case if latency requirement is in millisecond/nanosecond then any way CHM will not be good choice.

Reason for slow performance of LCHM is hash collision technique, double hashing is used for resolving has code collision.

Concurrent Add performance

New implementation perform slightly better when multiple threads are used to write to map.

Get Performance

Performance of get is slightly slower as compared to CHM.

Conclusion

New implementation out performs in memory test and it is bit slow in in get/put test, there are couple of things that can be done to improve performance of get/put and all the performance difference that we see is due to probing technique used.

Probing technique can be improved, linear probing technique can used to to get cache friendly access.
It is very easy to make probing technique parallel.