Are you ready: Concurrent Counter With No False Sharing

Saturday, 14 September 2013

Concurrent Counter With No False Sharing

This blog is continuation of Scalable Counter post.

One of the reader shared result from his system. He ran test on XEON intel processor with 16 core and total time taken for each type of counter is almost same, although Atomic counter has CAS failure & other type does't has any CAS failure but it does't make any difference in execution time.
Very strange result , needs further investigation.

Another reader pointed out that it could be due to false sharing, so it worth to take that into account and i created another class that handles FALSE SHARING

Time take by different counter

Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

PaddedAtomicCounter is new type of counter that i have added to test & it it outperforms all other counters.

It is using cache line padding to avoid false sharing.

Cacheline is of 64 byte on most of the today processor. PaddedCounter it Integer based counter so it is adding 16 slot per counter, by using this techniques we avoid cache pollution & as result we see 16X times gain as compared to AtomicCounter, without padding the gain was 5X and with cache line padding gain jumps to 16X.

With cacheline padding you need some extra space, so it is trade off of memory for speed, you can choose what you want!

CAS failure rate

Lets look at the CAS failure for different counter & what it means for performance.

Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

PaddedAtomic has some CAS failure as compared to other counters , but it does't make any difference in execution time of the counter.

CAS failure is not the only factory that can determined execution time, false sharing make significant contribution to it, so this gives good explanation of behavior seen in XEON processor

Conclusion

To get better performance you have to take care of few things

- Contention - There are many techniques to avoid it , this blogs shows one of them

- False Sharing - You have to avoid false sharing to get best out of processor, padding is required for that. Some of the JDK classes are using padding are ThreadLocalRandom , now we have @Contended annotation from java to achive same thing, it is being used in ForkAndJoinPool

Code is available @ github

12 comments:

John12 February 2014 at 09:19
Thank you your article and code - It is very useful.

I have an observation when running CounterTest.java from your git url.

I am running on a mac with the following details:

$ system_profiler SPHardwareDataType
Hardware:

Hardware Overview:

Model Name: MacBook Pro
Model Identifier: MacBookPro10,1
Processor Name: Intel Core i7
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Memory: 16 GB

I tried running CounterTest from your github code with 8 threads and got output:

[AtomicCounter] No Of Thread: 8 , Min:1034 ms , Max: 1104 ms , Avg: 1080 ms, Fail CAS: 27704531
[CoreBasedCounter] No Of Thread: 8 , Min:183 ms , Max: 213 ms , Avg: 206 ms, Fail CAS: 0
[CoreBasedExtraSlotCounter] No Of Thread: 8 , Min:263 ms , Max: 273 ms , Avg: 268 ms, Fail CAS: 0
[ThreadBasedCounter] No Of Thread: 8 , Min:229 ms , Max: 268 ms , Avg: 257 ms, Fail CAS: 0
[PaddedAtomicCounter] No Of Thread: 8 , Min:71 ms , Max: 94 ms , Avg: 84 ms, Fail CAS: 0

I then tried with 9 threads and got:

[AtomicCounter] No Of Thread: 9 , Min:1219 ms , Max: 1328 ms , Avg: 1298 ms, Fail CAS: 29879460
[CoreBasedCounter] No Of Thread: 9 , Min:233 ms , Max: 263 ms , Avg: 249 ms, Fail CAS: 722602
[CoreBasedExtraSlotCounter] No Of Thread: 9 , Min:298 ms , Max: 315 ms , Avg: 307 ms, Fail CAS: 479829
[ThreadBasedCounter] No Of Thread: 9 , Min:188 ms , Max: 214 ms , Avg: 201 ms, Fail CAS: 0
[PaddedAtomicCounter] No Of Thread: 9 , Min:27 ms , Max: 86 ms , Avg: 43 ms, Fail CAS: 948986

I think the core I7 is hyper threaded so the 4 cores behaves similar to 8 cores.

What is a bit surprising to me is that PaddedAtomicCounter performs better when it has 9 threads instead of 8. The average is less even though there is more variance between Min and Max. Note that when I specified 9 threads (ie one more than the 8 hyperthreads of the 4 cores) that I got a lot of Failed CAS (948986) but the average was still better than 8 threads. When I specified 8 threads I got 0 failed CAS for PaddedAtomicCounter.
ReplyDelete
Replies
Ashkrit13 February 2014 at 07:50
Thanks for sharing performance result from Mac. I never tested this on Mac.
Are you getting same types of result for multiple runs or stats are all over place ?

CAS failure indicator is little bit miss leading , if you see cas failure it does't not mean that it is slow, i noticed this on XEON processor where CAS failure was high for AtomicLong but performance was decent.

ReplyDelete
Replies
John17 February 2014 at 02:27
Please correct me if I am wrong:

The way I understand the PaddedAtomicCounter is that a cache line for level 3 is 64 bits long so the counter we are updating atomically should also be 64 bits long to stop the cache line being shared between threads running on diffierent cores. Level 3 cache is shared across cores while level 1 and level 2 are just for the core they are next to. If the counter was less than 64 bits then it could mean that more than one cores could be using the same cache line. Sharing the cache line between cores is undesirable for counters as it results in false sharing. The way I understand false sharing is that the ordering of instructions is not guaranteed unless certain write barrier instructions are used. These write barrier instructions include instructions which mark a cache with a flag so that its contents can be written to different locations including buffers, other cache lines or main memory. It is these write barrier instructions which ensure that threads running on different cores can read the same most up to date value without any reordering of instructions occurring in the critical synchronization areas. I think MOB buffers can ensure that ordering is maintained in critical synchronization areas. If a cache line is shared between more than one thread (ie false sharing) then in some way I don't understand there is a performance hit as in some way contexts have to be unloaded and loaded. By padding out the counter it somehow prevents the cache line having to be loaded and unloaded between threads. I think something like that is the mechanism but I really don't know and would like if someone can correct my understanding. Note my understanding in this comment is very likely flawed so please wait for a comment from someone with authority like Ashkrit or Martin Thompson before drawing any conclusions.
ReplyDelete
Replies
Ashkrit17 February 2014 at 04:22
That is long description , you are right. I would like to add some more info to it.

Cache line size for each of level(i.e all L1/L2/L3) is same, for all modern processor it is 64 bytes.
CPU brings data in unit of cache line, so if you request 1 byte, you will still get 64 byte of data. These 65byte of data is near by data, cpu brings up near by data thinking that your process might need it. Now if 2 variable that are updated very frequently are part of same cache line, it will result in cache line invalidation.
To avoid such situation you pad your variable.

Below is description of "False Sharing" from intel manual .
False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces an update, which hurts performance

You can also go through below link to get more details about the types of issue you get with cache line sharing.

http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
ReplyDelete
Replies
John18 February 2014 at 05:19
The following is how I understand the PaddedAtomicInteger from Ashkrit, although I am not sure if I understand correctly:

I read that the object size requirements on a 64 bit operating system are more than a 32 bit operating system and are as follows:

16 bytes for an object header
24 bytes set up cost for an int array + 4*n where n is the length of an array and 4 is how many bytes an int requires.

The initialization code of PaddedAtomicInteger looks like:

private int padding = 16;
private int slots = Runtime.getRuntime().availableProcessors();

private AtomicIntegerArray atomicIntegerArray =
new AtomicIntegerArray( slots * padding );

The maths is different depending on whether a core i5 or core i7 is used. For Core i5 it looks like:

The size of the AtomicIntegerArray in Ashkrit's PaddedAtomicInteger is initialized with:

16 x Runtime.getRuntime().availableProcessors()

On a core i5 the available processes will return 4, so the array length is 16 x 4 = 64

I think the size requirements on 64 bit x86 architecture is:

16 bytes object header
24 bytes set up cost of int array + 4*n where n is length of array = 24 + 64*4 = 280

The total of this is 16 + 280 = 296 bytes

296 bytes is more than the 64 bytes size of a level 3 cache line on x86 architecture.

However, according to http://mechanical-sympathy.blogspot.co.uk/2011/08/false-sharing-java-7.html Martin Thompson has made the following comment:

"It does not matter if too much padding it used. It is important that we use enough."

Does this sound correct?
ReplyDelete
Replies
John18 February 2014 at 05:20
Trying to ensure what ends up going in the level 3 cache is in no way guaranteed. I get this impression from reading: http://mechanical-sympathy.blogspot.co.uk/2011/07/false-sharing.html. The article says:

"However experience shows that objects allocated at the same time tend to be co-located."

The word "tend" suggests that it is up to the implementation of the JVM and hardware how stuff gets done. In the case of an array which is bigger than a cache line size I dont know what happens? Maybe this all depends on the JVM and hardware? Who knows?

Looking at Ashkrit's PaddedAtomicInteger code it seems that each thread updates a seperate index of the array which is determined by taking the thread id and doing a mod using the array size. Thread has a static counter where each time you create a new thread a counter is incremented. As the number of threads the JVM is creating increases we will see that more than one thread could update the same index of the array. However this is not a problem as the while loop using the CAS operation will loop until it suceeds. Only when there are many threads and there is heavy contention will CAS be less efficient than using wait and notify.

The get implementation for Ashkrit's PaddedAtomicInteger loops through the array and adds all the values in the array. I suppose that getting threads to use different indexes of the array and having a method that adds the values of all slots up together is a clever trick to stop the JVM or JIT compiler doing optimisations which removes padding from the cache lines. Does this assumption seem correct?

As a result of the adding feature, it is likely that the cache line will be populated with data from adjacent indexes of the array. This ensures that threads running on different cores will not populate the same cache line.

When the same cache line is populated by another thread I think that this causes the cache to be marked with a flag to indicate that the cache is invalid. I think the idea is that the Buffers and/or Cache are the master and that when the cache is marked invalid that this means that the data has to be writen out to other areas which are used by other threads running on other cores. These other areas are places like main memory. This invalidation process ensures that all threads running on different cores see a consistent view of the world. I guess that invalidation might also mark Level 1 caches on other cores to be invalid so that they load up more "up to date" data. I dont know if this guess is correct?

This invalidation process effectively becomes a bottle neck and the bottle neck is the opportunity cost of lost computer cycles - i.e. in the time wasted doing the invalidation, the core could have been using its cycles to do something useful. False Sharing therefore increases latency for a method to complete. I got this understanding from:

http://mechanical-sympathy.blogspot.co.uk/2011/07/false-sharing.html

In the above link Martin wrote:

"A thread running on core 1 wants to update variable X while a thread on core 2 wants to update variable Y. Unfortunately these two hot variables reside in the same cache line. Each thread will race for ownership of the cache line so they can update it. If core 1 gets ownership then the cache sub-system will need to invalidate the corresponding cache line for core 2. When Core 2 gets ownership and performs its update, then core 1 will be told to invalidate its copy of the cache line. This will ping pong back and forth via the L3 cache greatly impacting performance."
ReplyDelete
Replies
John18 February 2014 at 05:21
Ashkrit's PaddedAtomicInteger is using an AtomicIntegerArray internally and the AtomicIntegerArray is using an array internally. AtomicIntegerArray is using Unsafe.compareAndSwapInt(..) to update the data in the array. It seems that one of Disruptor's counters (from lmax) is using Unsafe.putOrderedLong(..) instead (see com.lmax.disruptor.Sequence). The way I understand putOrderedLong(.,) is that it is used where you might want to use volatile but it is infact 3 times faster than a volatile which is why it is used. It generally becomes visible to other threads within very few nanoseconds but this is not 100% guaranteed. However Unsafe.compareAndSwapInt(..) of AtomicIntegerArray is supposed to be a bit slower but has the advantage of being useful where volatile semantics are not enough. Is this correct?
ReplyDelete
Replies
Ashkrit21 February 2014 at 06:34
This comment has been removed by the author.
ReplyDelete
Replies

Add comment