Thursday, 25 July 2013

Which memory is faster Heap or ByteBuffer or Direct ?

Java is becoming new C/C++ , it is extensively used in developing High Performance System.
Good for millions of Java developer like me:-)

In this blog i will share my experiment with different types of memory allocation that can be done in java and what type of benefit you get with that.


Memory Allocation In Java
What type of support Java provide for memory allocation

 - Heap Memory
I don't i have to explain this, all java application starts with this.  All object allocated using "new" keyword goes under Heap Memory

- Non Direct ByteBuffer
It is wrapper over byte array, just flavor of Heap Memory.
ByteBuffer.allocate() can be used to create this type of object, very useful if you want to deal in terms of bytes not Object.

 - Direct ByteBuffer
This is the real stuff that java added since JDK 1.4.
Description of Direct ByteBuffer based on Java Doc

"A direct byte buffer may be created by invoking the allocateDirect factory method of this class. The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance."

Important thing to note about Direct Buffer is 
 - It is Outside of JVM
 - Free from Garbage Collector reach.

These are very important thing if you care about performance.
MemoryMapped file are also flavor of Direct byte buffer, i shared some of my finding with that in below blogs

arraylist-using-memory-mapped-file
power-of-java-memorymapped-file

- Off Heap or Direct Memory
This is almost same as Direct ByteBuffer but with little different, it can be allocated by unsafe.allocateMemory, as it is direct memory so it creates no GC overhead. Such type of memory must be manually released.

In theory Java programmer are not allowed to do such allocation and i think reason could be
 - It is complex to manipulate such type of memory because you are only dealing with bytes not object
 - C/C++ community will not like it :-)

Lets take deep dive into memory allocation

For memory allocation test i will use 13 byte of message & it is broken down into
 - int - 4 byte
 - long - 8 byte
 - byte - 1 byte

I will only test write/read performance, i am not testing memory consumption/allocation speed.

Write Performance



X Axis - No Of Reading
Y Axis - Op/Second in Millions











5 Million 13 bytes object are written using 4 types of allocation.
Direct ByteBuffer & Off Heap are best in this case, throughput is close to 350 Million/Sec
Normal ByteBuffer is very slow, TP is just 85 Million/Sec
Direct/Off Heap is around 1.5X times faster than heap


I did same test with 50 Million object to check how does it scale, below is graph for same.


X Axis - No Of Reading
Y Axis - Op/Second in Millions










Numbers are almost same as 5 Million.

Read Performance

Lets look at read performance


X Axis - No Of Reading
Y Axis - Op/Second in Millions











This number is interesting, OFF heap is blazing fast throughput for 12,000 Millions/Sec :-)
Only close one is HEAP read which is around 6X times slower than OFF Heap.

Look at Direct ByteBuffer , it is tanked at just 400 Million/Sec, not sure why it is so

Lets have look at number for 50 Million Object

X Axis - No Of Reading
Y Axis - Op/Second in Millions












Not much different.

Conclusion
Off heap via Unsafe is blazing fast with 330/11200 Million/Sec.
Performance for all other types of allocation is either good for read or write, none of the allocation is good for both.
Special note about ByteBuffer, it is pathetic , i am sure you will not use this after seeing such number.
DirectBytebuffer sucks in read speed, i am not sure why it is so slow

So if memory read/write is becoming bottle neck in your system then definitely Off-heap is the way to go, remember it is highway, so drive with care.

Code is available @ git hub

Update
Fixing broken code link - code available at github

Saturday, 20 July 2013

ArrayList Using Memory Mapped File

Introduction
In-Memory computing is picking up due to affordable hardware, most of the data is kept in RAM to meet latency and throughput goal, but keeping data in RAM create Garbage Collector overhead especially if you don't pre allocate.
So effectively we need garbage less/free approach to avoid GC hiccups

Garbage free/less data structure
There are couple of option to achieve it
 - Object Pool 
Object pool pattern is very good solution, i wrote about that in Lock Less Object Pool blog

- Off Heap Objects
JVM has very good support for creating off-heap objects. You can get rid of GC pause if you take this highway and highway has its own risk!

-MemoryMapped File
This is mix of Heap & Off Heap, like best of world.

Memory mapped file will allow to map part of the data in memory and that memory will be managed by OS, so it will create very less memory overhead in JVM process that is mapping file.
This can help in managing data in garbage free way and you can have JVM managing large data.
Memory Mapped file can be used to develop IPC, i wrote about that in power-of-java-memorymapped-file blog

In this blog i will create ArrayList that is backed up by MemoryMapped File, this array list can store millions of object and with almost no GC overhead. It sounds crazy but it is possible.

Lets gets in action
In this test i use Instrument object that has below attribute
 - int id
 - double price

So each object is of 12 byte.
This new Array List holds 10 Million Object and i will try to measure writer/read performance

Writer Performance



X Axis - No Of Reading
Y Axis - Time taken to add 10 Million in Ms









Adding 10 Million element is taking around 70 Ms, it is pretty fast.

Writer Throughput
Lets look at another aspect of performance which is throughput


X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions









Writer throughput is very impressive, i ranges between 138 Million to 142 Million

Reader Performance

X Axis - No Of Reading
Y Axis - Time taken to read 10 Million in Ms









It is taking around 44 Ms to read 10 Million entry, very fast. With such type of performance you definitely challenge database.

 Reader Throughput

X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions









Wow Throughput is great it is 220+ million per second

It looks very promising with 138 Million/Sec writer throughput & 220 Million/Sec reader throughput.

Comparison With Array List
Lets compare performance of BigArrayList with ArrayList,

Writer Throughput - BigArrayList Vs ArrayList




 Throughput of BigArrayList is almost constant at around 138 Million/Sec, ArrayList starts with 50 Million and drops under 5 million.

ArrayList has lot of hiccups and it is due to 
 - Array Allocation
 - Array Copy
 - Garbage Collection overhead

BigArrayList is winner in this case, it is 7X times faster than arraylist.

Reader Throughput - BigArrayList Vs ArrayList

ArrayList performs better than BigArrayList, it is around 1X time faster.

BigArrayList is slower in this case because
 - It has to keep mapping file in memory as more data is requested
 - There is cost of un-marshaling

Reader Throughput for BigArrayList is 220+ Million/Sec, it is still very fast and only few application want to process message faster than that.
So for most of the use-case this should work.

Reader performance can be improved by using below techniques 
 - Read message in batch from mapped stream
 - Pre-fetch message by using Index, like what CPU does

By doing above changes we can improve performance by few million, but i think for most of the case current performance is pretty good

Conclusion
Memory mapped file is interesting area to do research, it can solve many performance problem.
Java is now being used for developing trading application and GC is one question that you have to answer from day one, you need to find a way to keep GC happy and MemoryMapped is one thing that GC will love it.

Code used for this blog is available @ GitHub , i ran test with 2gb memory.
Code does't handle some edge case , but good enough to prove the point that that MemoryMapped file can be winner in many case.

Sunday, 7 July 2013

How To Write Micro benchmark In Java

So many article has been written on how to write micro-bench mark in java, this blog is my attempt to explain the topic.

What is Micro benchmark
Micro benchmark try to measure actual speed of small piece of code, measuring performance is very complex because JIT does magic to make code fast and you have to measure the code after all the JIT optimization is done.

What factors should be considered for benchmark 
Some of the factors that you should consider for correct bench-marking are

  • Warm-up phase - This is to start measuring only after JIT has compiled hot code, JIT will compile functions only after 10K invocation.
  • Client vs Server - Running JVM in client or server mode will give different performance result
  • GC  - Garbage Collector makes measurement really difficult 
  • Re-compilation effect - Code executing first time will be slow
More details are available @ MicroBenchmarks Rules

How to write micro benchmark
This can be very difficult to write but not to worry, now there are some framework available to do that
 - Caliper  - It is Google solution for writing benchmark
 - JMH   - This one is from java guys, in this blog i will explain some of example using JMH

How to use JMH
JMH is maven based project but don't know the reason why it is not hosted on maven central.
You need TortoiseHg to download code, once you have the code then it is cake walk to setup/build the JMH project.
Refer to Setup page for detail instruction on setup.

Lets look at some examples
As every programming language starts with hello world, so does JHM .
JMH has used annotation based approach to specify which function you want to run as part of benchmark,  GenerateMicroBenchmark annotation is used for that.

-Empty function 
JVM is complex piece of software and it has some overhead, having empty function checks that overhead
Numbers that come outs with this test are 

Run result "wellHelloThere": 2893313.782 ▒(95%) 118441.448 ▒(99%) 161902.046 ops
/msec
Run statistics "wellHelloThere": min = 2388762.991, avg = 2893313.782, max = 310
8116.404, stdev = 253075.135
Run confidence intervals "wellHelloThere": 95% [2774872.334, 3011755.229], 99% [
2731411.736, 3055215.827]

This is interesting number, if you have some empty function then it can be called 2.8 million times per mili second, that is very fast.

JHM does all the magic and shows number that we are interested in. JMH is based on code generation, it generate lots of code to measure performance, all the generated code is available under "generated-sources"

Have look at generated file

That is lot of code!

-Benchmark modes
JHM has benchmark modes, very useful, it has most of the mode that you can think of for eg
 - Throughput
- AverageTime
 - SampleTime - Time distribution, percentile estimation
- SingleShotTime

I must say very well thought of by java team.

-States
This is very common use case , this test measure overhead of shared object vs per thread object. Result are interesting.
Run result "measureUnshared": 1264160.208 (95%) 20349.174 (99%)
Run result "measureShared": 965178.728 (95%) 20847.118 (99%)

Unshared test is around 30% fast as compared to shared & it is due to contention on memory .

-DeadCode
This benchmark measure effect of dead code, very common case.
Lets look at sample code
Results are interesting in this case, JVM is very smart enough to remove dead code from call and time take for function as good as there is no call to log function.

-Constant Folding
This is another interesting optimization that is done by compiler,Constant folding techniques will optimize call to functions that takes constant parameter,sometime this can result in tricky defects.
Way to avoid trap is read inputs via some state. Lets have look at sample code

measureWrong function is faster because JVM performs optimization based on constant folding rules, but be very cautious this an cause some defect in you code!

-Multiple Implementation
This is very interesting test, this test has multiple implementation of counter interface, code is same but just 2 implementation and result are random, you can't conclude why one implementation fast as compared to other.

JVM will mix profile of two test and due to which result are random. JMH has option to measure this type of test properly, there is annotation Fork that will execute benchmark in new JVM.

- Compile Control
JMH has support to measure effect of inline, no inline, exclude compile. These things are really useful to find performance number under different scenario. Sample benhmark for compile control

- Asymmetric
JMH has very good support for asymmetric execution for eg producer/consumer, in traditional test we have to write thread related code to make this happen, but JMH has declarative way of specifying such behavior, multiple benchmark methods can be grouped by using Group annotation.
Sample code for grouping.

Conclusion
JHM is very useful tool for micro bench marking, it is worth spending time to move some of the performance test of application to JHM.
Writing benchmark will give very good understanding of how JVM works.