Showing posts with label CAS. Show all posts
Showing posts with label CAS. Show all posts

Monday, 24 July 2017

Java Intrinsic magic

I got question on atomicinteger-java-7-vs-java-8 blog about implementation of unsafe.getAndAddInt function



It looks like JDK8 is still using CAS version.
If you just look at the code in IDE then you are correct , but JVM is intelligent hotspot compiler.

Code goes through various optimization before it is executed, optimization list is huge and it will required blog series to just touch on it.

In this blog i will discuss about Intrinsic function

As per WikiPedia 


In compiler theory, an intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it substitutes a sequence of automatically generated instructions for the original function call, similar to an inline function. Unlike an inline function though, the compiler has an intimate knowledge of the intrinsic function and can therefore better integrate it and optimize it for the situation. This is also called builtin function in many languages.
Compilers that implement intrinsic functions generally enable them only when the user has requested optimization, falling back to a default implementation provided by the language runtime environment otherwise.


So in other words, some function are just like native function, it is different from native function written by us.


Compiler makes decision to use native/special function if is available other wise if will fallback to default implementation.

Unsafe.getAndAddInt is special function that gets advantage of optimization, but normal code is
still required for case when optimization is not requested or not available

To verify this we have to look at Assembly code generated.

how-to-print-dissasembly-from-jit-code article has all the details you need for this.

You can also download dll/lib for windows/linux from github

For below code snippet, optimization kicks in

Assembly code for above code looks like below ( look at line number 7, it is xadd)



Java maintains list of all intrinsic functions in code @ vmSymbols.hpp

Hotspot compiler is full of magic:-)

Friday, 14 March 2014

Off Heap concurrent counter

Concurrent counter are part of almost every system, it is used to collect data, thread synchronization etc. 
Java has good support of heap based counter.

There could be use case when you need counter that can be shared between processor.

How to build inter process counters 
 - Database 
This is first option that comes to mind, database sequence is counter that can be used by multiple process. All concurrency is handled by database. It is good option for starter but we know types of overhead(Network,Locks etc) you get with database. Only Larry Elision will be happy about it, not you!
 - Some Server
You could develop some server/middleware that provides such type of service. This option will still have network latency,marshal/unmarshal overhead.

 - Memory Mapped file
You could use memory mapped file to do this. I got idea from looking at thread-safe-interprocess-shared-memory-in-java presentation from PeterLawrey.

Challenges involved in multi process counter.
- Data visibility 
    Changes done by one process should be visible to all the process. This problem can be solved by using memory mapped file. Operating System gives guarantee about it and java memory model supports it to make is possible. 

- Thread safety 
Counters is all about multiple writers , so thread safety becomes big problem. Compare-and-swap is one option to handles multiple writers.
Is it possible to use CAS for off heap operation ? yes it is possible to do that , welcome to Unsafe.
By using using Memorymapped & Unsafe it is possible to use CAS for Off heap operation.

In this blog i will share my experiment of Memory mapped using CAS.

How ?
- How to get memory address
MappedByteBuffer is using DirectByteBuffer, which is off heap memory. So it is possible to get virtual address of memory and use unsafe to perform CAS operation. Lets look at the code.


Above code create memory mapped file of 8 bytes and get the virtual address. This address can be be used to read/write content of memory mapped file.

- How to Write/read in thread safe manner

Important function to look are readVolatile and increment
readVolatile reads directly from memory and increment is using unsafe to perform CAS on the address obtained from MemoryByteBuffer.

-Performance
Some performance numbers from my system. Each thread increments counter 1 Million times.










Performance of counter is decent , as number of threads are increased CAS failures starts to happen and performance starts to degrade.
Performance of these counter can be improved by having multiple segment to reduce write contention.
I will write about it in next blog.

Conclusion
 - Memory mapped file is very powerful, it can be used to developed lot of things like off heap collections, IPC, Off heap thread coordination etc.
  - Memory mapped file opens gates for GC less programming.

All the code used in this blog is available on github.

Monday, 17 February 2014

AtomicInteger Java 7 vs Java 8

Atomic Integer is interesting class, it is used for building many lock free algorithm. Infact JDK locks are also build using ideas from Atomic datatypes.

As name suggest it is used for doing atomic increment/decremented, so you don't have to use locks, it will use processor level instruction to do so.
It is based on Compare-and-swap instruction.

Issue with CAS
CAS works on optimistic approach, it expects some failure, so it will retry operation, so in theory if there is no contention then it should work pretty fast.

There is another alternate way of doing same thing using Fetch-and-add.
Fetch-and-add is very different from CAS, it is not based on re-try loops.

Dave Dice compares CAS vs Fetch-and-add in atomic_fetch_and_add_vs blog, i can't explain better than this, so i will copy content from his blog


  1. CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of vulnerability to remote interference, and thus no need for a retry loop. Arguably, XADD has better progress properties, assuming the underlying XADD implementation doesn't have an implicit loop, but even in that case the window would be narrower than with Load;Φ;CAS.
  2. Lets say you were trying to increment a variable with the usual Load;INC;CAS loop. When the CAS starts failing with sufficient frequency you can find that the branch to exit the loop (normally taken under no or light contention) starts to predict toward the failure path. So when the CAS ultimately succeeds, you'll incur a branch mispredict, which can be quite painful on processors with deep pipelines and lots of out-of-order speculative machinery. Typically, this is in a piece of code where you don't want a long stall. There's no loop and no such issues with XADD.
Since fetch-and-add has predictable progress properties, so it is used for developing waiting free algorithms.
Unfortunately JDK 7 does not have support for fetch-and-add, one more reason why C++ people will be happy that C++ is great!


As we all know things do change & java community decided to added support for fetch-and-add in JDK8, one more good reason to migrate to JDK8.

In this blog i will compare performance of AtomicInteger from JDK7 & 8

Atomic Integer - JDK 7 vs JDK 8

In this test i increment counter 10 Million times with different number of threads. Thread numbers are increased to see how counter performs under contention.



X Axis - No of Threads
Y Axis - Ops/Second - Higher is better

JDK8 counter is winner in this case, best performance is when there is no contention , for JDK7 it is 80 MOPS but for JDK8 it close to 130 MOPS.
For single thread difference is not much , JDK8 is around 0.5 times faster but as contention increases performance JDK7 counter starts falling.

I will put another graph by removing 1 thread number, so that we can clearly see how these counter performs.


This gives better idea of how slow JDK7 atomic integer is, for 8 threads JDK8 counter is around 3.5X times faster.

Dive Into Code

JDK 8 - AtomicInteger
public final int getAndIncrement() {
        return unsafe.getAndAddInt(this, valueOffset, 1);
    }

JDK7 - AtomicInteger
 public final int getAndIncrement() {
        for (;;) {
            int current = get();
            int next = current + 1;
            if (compareAndSet(current, next))
                return current;
        }
    }

JDK8 is using new function(getAndAddInt) from unsafe to do the magic. Unsafe has become more useful!

Dive in Assembly
To just confirm that all performance again is coming from fetch-and-add i had look at assembly generated.

JDK 8 
0x0000000002cf49c7: mov    %rbp,0x10(%rsp)
  0x0000000002cf49cc: mov    $0x1,%eax
  0x0000000002cf49d1: lock xadd %eax,0xc(%rdx)  ;*invokevirtual getAndAddInt
                                                ; - java.util.concurrent.atomic.AtomicInteger::incrementAndGet@8 (line 186)


JDK 7

0x0000000002c207f5: lock cmpxchg %r8d,0xc(%rdx)
  0x0000000002c207fb: sete   %r11b
  0x0000000002c207ff: movzbl %r11b,%r11d        ;*invokevirtual compareAndSwapInt
                                                ; - java.util.concurrent.atomic.AtomicInteger::compareAndSet@9 (line 135)
                                                ; - java.util.concurrent.atomic.AtomicInteger::incrementAndGet@12 (line 206)

  
Conclusion
Introduction of fetch-and-add type of feature in java will make it more suitable for high performance computing, we will see more wait free algorithm in java

Code used for testing is available @ AtomicCounterTest
Just compile for jdk7/8 and execute it.

 Integer Java Class Example

Saturday, 14 September 2013

Concurrent Counter With No False Sharing

This blog is continuation of Scalable Counter post.

One of the reader shared result from his system. He ran test on XEON intel processor with 16 core and total time taken for each type of counter is almost same, although Atomic counter has CAS failure & other type does't has any CAS failure but it does't make any difference in execution time.
Very strange result , needs further investigation.

Another reader pointed out that it could be due to false sharing, so it worth to take that into account and i created another class that handles FALSE SHARING

Time take by different counter
 Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

PaddedAtomicCounter is new type of counter that i have added to test & it it outperforms all other counters.
It is using cache line padding to avoid false sharing.
Cacheline is of 64 byte on most of the today processor. PaddedCounter it Integer based counter so it is adding 16 slot per counter, by using this techniques we avoid cache pollution & as result we see 16X times gain as compared to AtomicCounter, without padding the gain was 5X and with cache line padding gain jumps to 16X.
With cacheline padding you need some extra space, so it is trade off of memory for speed, you can choose what you want!

CAS failure rate
Lets look at the CAS failure for different counter & what it means for performance.


Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

PaddedAtomic has some CAS failure as compared to other counters , but it does't make any difference in execution time of the counter.
CAS failure is not the only factory that can determined execution time, false sharing make significant contribution to it, so this gives good explanation of behavior seen in XEON processor

Conclusion
To get better performance you have to take care of few things

 -  Contention  - There are many techniques to avoid it , this blogs shows one of them

 -    False Sharing - You have to avoid false sharing to get best out of processor, padding is required for that. Some of the JDK classes are using padding are ThreadLocalRandom , now we have @Contended annotation from java to achive same thing, it is being used in ForkAndJoinPool

Code is available @ github

   

Tuesday, 10 September 2013

Scalable Counters For Multi Core

Counters are required everywhere , for e.g. to find key KPI of application, load on application, total number of request served, some KPI for finding throughput of application & many more.

With all these requirement complexity of concurrency is also added & that makes this problem interesting.

How to implement concurrent counter

 - Synchronized - This was the only option before JDK 1.5, since now we are waiting for JDK8 release , so definitely this is not an option.

- Lock based  - You should never attempt this for counter , it will perform very badly

- Wait Free - Java does't have support for Fetch-and-add, so bit difficult to implement it.

- Lock free - With very good support of Compare-and-swap, this looks good option to use.

How does Compare-and-Swap based counter performs

I used AtomicInteger for this test and this counter is incremented for 1 Million time by each thread & to increase the contention number of threads are increased gradually.

Test Machine Details
OS : Windows 8
JDK : 1.7.0.25
CPU : Intel i7-3632QM , 8 Core
RAM : 8 GB











Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

As number of threads are increased, time taken to increment counter increases and it is due to contention.
For CAS based counter , it is CAS failure that causes slowdown.

Is this best performance that we can get ? no definitely their are better solution to implement concurrent counter, lets have look at them.

Alternate Concurrent Counter
Lets look at some of the solution to implement counter that handles contention in better way

- Core Based Counter - Maintain counter for each logical core, so that way you will have less contention. Only issue you have this type of counter is that if number of threads are more than logical core then you will start noticing contention.

- Thread Based Counter - Maintain counters for total number of threads that will be using system. This works well when number of threads are more than number of logical cores.


Lets test it

Time taken by different types of counter










Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

Concurrent Counter performs much better than Atomic based counter, for 16 threads it is around 5X times better, that is huge difference!

CAS Failure Rate

Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

Due to contention, Atomic based counter see lot of failure and it goes up exponential as i add more threads & other counters performs pretty well.

Observation
  Multi core machines becoming easily available & we have to change the way we handle concurrency, traditional way of doing concurrency is not going to scale in today's time when having 24 or 48 core server is very common.

 - To reduce the contention you have to use multiple counters and then aggregate them later

 - Core based counter works well if number of threads will be less or same as number of cores

 - Thread based counter is good when number of threads are much more than available core

 - Key to reduce contention is identify counter to which thread will write,i have used simple approach based on thread id, but much better approach are available, look at ThreadLocalRandom of JDK 8 for some ideas.

 - Thread based approach is used in LongAdder of JDK8, which creates many slots to reduce contention.

Code for all the counters used in this test are available @ Github