Showing posts with label GarbageCollection. Show all posts
Showing posts with label GarbageCollection. Show all posts

Saturday, 17 August 2019

JVM with no garbage collection

JVM community keeps on adding new GC and recently new one was added and it is called Epsilon and is very special one. Epsilon only allocates memory but will not reclaim any memory.

Image result for garbage collection

It might look like what is use of GC that does not perform any garbage collection. This type of Garbage Collector has special use and we will look into some.

Where this shinny GC can be used ?
Performance Testing

If you are developing solution that has tight latency requirement and limited memory budget then this GC can be used to test limit of program.

Memory Pressure Testing
Want to know extract transient memory requirement by your application. I find this useful if you are building some pure In-Memory solution.

Bench marking Algorithm.
Many time we want to test the real performance of new cool algorithm based on our understanding of BIG (O) notion but garbage collector adds noise during testing.

Low Garbage
Many times we do some optimization in algorithm to reduce garbage produced and GC like epsilon helps in scientific verification of optimization.

How to enable epsilon GC

JVM engineers have taken special care that this GC should not enabled by default in production , so to use this GC we have to use below JVM options

-XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xlog:gc


One question that might be coming in your mind what happens when memory is exhausted ? JVM will stop with OutofMemory Error.

Lets look at some code to test GC

How to know if epsilon is used in JVM process?

Java has good management API that allows to query current GC being used, this can also be used to verify what is the default GC in different version of java.

Run above code with below options
-XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC VerifyCurrentGC

How does code behave when memory is exhausted. 

I will use below code to show how new GC works.

Running above code with default GC and requesting 5GB allocation causes no issue (java -Xlog:gc -Dmb=5024 MemoryAllocator) and it produces below output

[0.016s][info][gc] Using G1
[0.041s][info][gc] Periodic GC disabled
Start allocation of 5024 MBs
[0.197s][info][gc] GC(0) Pause Young (Concurrent Start) (G1 Humongous Allocation) 116M->0M(254M) 3.286ms
[0.197s][info][gc] GC(1) Concurrent Cycle
[0.203s][info][gc] GC(1) Pause Remark 20M->20M(70M) 4.387ms
[0.203s][info][gc] GC(1) Pause Cleanup 22M->22M(70M) 0.043ms
[1.600s][info][gc] GC(397) Concurrent Cycle 6.612ms
[1.601s][info][gc] GC(398) Pause Young (Concurrent Start) (G1 Humongous Allocation) 52M->0M(117M) 1.073ms
[1.601s][info][gc] GC(399) Concurrent Cycle
I was Alive after allocation
[1.606s][info][gc] GC(399) Pause Remark 35M->35M(117M) 0.382ms

[1.607s][info][gc] GC(399) Pause Cleanup 35M->35M(117M) 0.093ms
[1.607s][info][gc] GC(399) Concurrent Cycle 6.062ms

Lets add some memory limit ( java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xlog:gc -Xmx1g -Dmb=5024 MemoryAllocator)

[0.011s][info][gc] Resizeable heap; starting at 253M, max: 1024M, step: 128M
[0.011s][info][gc] Using TLAB allocation; max: 4096K
[0.011s][info][gc] Elastic TLABs enabled; elasticity: 1.10x
[0.011s][info][gc] Elastic TLABs decay enabled; decay time: 1000ms
[0.011s][info][gc] Using Epsilon
Start allocation of 5024 MBs
[0.147s][info][gc] Heap: 1024M reserved, 253M (24.77%) committed, 52640K (5.02%) used
[0.171s][info][gc] Heap: 1024M reserved, 253M (24.77%) committed, 103M (10.10%) used
[0.579s][info][gc] Heap: 1024M reserved, 1021M (99.77%) committed, 935M (91.35%) used
[0.605s][info][gc] Heap: 1024M reserved, 1021M (99.77%) committed, 987M (96.43%) used

Terminating due to java.lang.OutOfMemoryError: Java heap space

This particular run caused OOM error and is good confirmation that after 1GB this program will crashed.

Same behavior is true multi thread program also, refer to MultiThreadMemoryAllocator.java for sample.

Unit Tests are available to test features of this special GC.

I think Epsilon will find more use case and adoption in future and this is definitely a good step to increase reach of JVM.

All the code samples are available Github repo

If you like the post then you can follow me on twitter .


Sunday, 18 November 2018

Insights from Spark UI

As continuation of anatomy-of-apache-spark-job post i will share how you can use Spark UI for tuning job

I will continue with same example that was used in earlier post, new spark application will do below things

 - Read new york city parking ticket
 - Aggregation by "Plate ID" and calculate offence dates
 - Save result

DAG for this code looks like this


























This is multi stage job, so some data shuffle is required, for this sample shuffle write is 564mb and output is 461 MB.

Lets see what we can do to reduce this ?
lets take top down approach from "Stage2". First thing that comes to mind is explore compression.

Current code

New Code


New code is only enabling gzip on write, lets see what we see on spark UI

Save with Gzip






With just write encoder write went down by 70%. Now it 135Mb and it speed up the job.

Lets see what else is possible before we dive in more internals tuning

Final output looks some like below

1RA32   1       05/07/2014
92062KA 2       07/29/2013,07/18/2013
GJJ1410 3       12/07/2016,03/04/2017,04/25/2015
FJZ3486 3       10/21/2013,01/25/2014
FDV7798 7       03/09/2014,01/14/2014,07/25/2014,11/21/2015,12/04/2015,01/16/2015

Offence date is stored in raw format, it is possible to apply little encoding on this to get some more speed.

Java 8 added LocalDate to make date manipulation easy and this class comes with some handy functions, one of that is toEpocDay.
This function convert date to day from 1970 and so it means that in 4 bytes(Int) we can store upto 5K years, this seems big saving as compared to current format which is taking 10 bytes.

Code snippet with epocDay

Spark UI after this change. I have also done one more change to use KryoSerializer






This is huge improvement , Shuffle write changed from 564Mb to 409MB ( 27% better) and output from 134Mb to 124 Mb( 8% better)

Now lets go to another section on Spark UI that shows logs from executor side.
GC logs for above run shows below thing

2018-10-28T17:13:35.332+0800: 130.281: [GC (Allocation Failure) [PSYoungGen: 306176K->20608K(327168K)] 456383K->170815K(992768K), 0.0222440 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
2018-10-28T17:13:35.941+0800: 130.889: [GC (Allocation Failure) [PSYoungGen: 326784K->19408K(327168K)] 476991K->186180K(992768K), 0.0152300 secs] [Times: user=0.09 sys=0.00, real=0.02 secs]
2018-10-28T17:13:36.367+0800: 131.315: [GC (GCLocker Initiated GC) [PSYoungGen: 324560K->18592K(324096K)] 491332K->199904K(989696K), 0.0130390 secs] [Times: user=0.11 sys=0.00, real=0.01 secs]
2018-10-28T17:13:36.771+0800: 131.720: [GC (GCLocker Initiated GC) [PSYoungGen: 323744K->18304K(326656K)] 505058K->215325K(992256K), 0.0152620 secs] [Times: user=0.09 sys=0.00, real=0.02 secs]
2018-10-28T17:13:37.201+0800: 132.149: [GC (Allocation Failure) [PSYoungGen: 323456K->20864K(326656K)] 520481K->233017K(992256K), 0.0199460 secs] [Times: user=0.12 sys=0.00, real=0.02 secs]
2018-10-28T17:13:37.672+0800: 132.620: [GC (Allocation Failure) [PSYoungGen: 326016K->18864K(327168K)] 538169K->245181K(992768K), 0.0237590 secs] [Times: user=0.17 sys=0.00, real=0.03 secs]
2018-10-28T17:13:38.057+0800: 133.005: [GC (GCLocker Initiated GC) [PSYoungGen: 324016K->17728K(327168K)] 550336K->259147K(992768K), 0.0153710 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
2018-10-28T17:13:38.478+0800: 133.426: [GC (Allocation Failure) [PSYoungGen: 322880K->18656K(326144K)] 564301K->277690K(991744K), 0.0156780 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
2018-10-28T17:13:38.951+0800: 133.899: [GC (Allocation Failure) [PSYoungGen: 323808K->21472K(326656K)] 582842K->294338K(992256K), 0.0157690 secs] [Times: user=0.09 sys=0.00, real=0.02 secs]
2018-10-28T17:13:39.384+0800: 134.332: [GC (Allocation Failure) [PSYoungGen: 326624K->18912K(317440K)] 599490K->305610K(983040K), 0.0126610 secs] [Times: user=0.11 sys=0.00, real=0.02 secs]
2018-10-28T17:13:39.993+0800: 134.941: [GC (Allocation Failure) [PSYoungGen: 313824K->17664K(322048K)] 600522K->320486K(987648K), 0.0111380 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]

Lets focus on one the line

2018-10-28T17:13:39.993+0800: 134.941: [GC (Allocation Failure) [PSYoungGen: 313824K->17664K(322048K)] 600522K->320486K(987648K), 0.0111380 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]

Heap before minor GC was 600MB and after that 320MB and total heap size is 987 MB.
Executor is allocated 2gb and this Spark application is not using all the memory, we can put more load on executor by send more task or bigger task.

I will reduce input partition from 270 to 100


With 270 input partition 


With 100 input partition













100 input partition looks better with around 10+% less data to shuffle.

Other tricks
Now i will share some of things that will make big difference in GC!

Code before optimization

Code after optimization

New code is doing optimized merge of set, it is adding small set to the big one and also introduced Case class.
Another optimization is in save function where it is using mapPartitions to reduce object allocation by using StringBuffer.

I used http://gceasy.io to get some GC stats.

Before code change


After code change

New code is producing less garbage for eg. 
 Total GC 126 gb vs 122 gb ( around 4% better)
 Max GC time 720ms vs 520 ms ( around 25% better)

Optimization looks promising.

All the code used in this blog is available on github repo sparkperformance

Stay tuned up for more on this.

Sunday, 22 September 2013

Linked List Revisited for In Memory computing

Linked list is one of the first data structure that you learn when you start reading about data structure.
Linked List has its own advantage over other data structure like array, just to recap few of them

Some benefit of Linked List
 - Insert cost is low, just update address of new node
 - Memory Allocation cost is very low because it has to just allocate one node and attach it to last node
 - Very good data structure when you don't know how much data you will be storing, it grow on need basis, no wastage of space
 - Very low overhead when element is removed.

Things that are not good about Linked List
 - Access by Index is bad and not recommended .

  - Creates lot of work for Garbage Collector, can result in frequent Minor GC due which young objects will be promoted early to old generation

  - Extra Node Object is required for each element , so total memory required is (Node+Value) * No Of Objects. This can become problem if there is lot of objects.

 - Random memory access because of the way nodes are linked, this is major draw back because random memory walk is slow as compared to linear walk, CPU does lot of hard work to optimized memory access but Linked List or Tree based Data Structure can't use all those optimization due to random memory walk.
On NUMA based machine it become worse.


Do we need Linked List now ?
That is really tough question to answer, but the way Linked List performs it looks like it has no space in Data Structure Library.

With the current In-Memory trend, where application is keeping all the data in memory and data structure like Linked List are not friendly for that.

Lets Tweak Linked List for today requirement.
So today we need Linked List that has advantage of both Linked List & Array, that way we can get best of both world.

For testing i will create MultiValueLinkedList that has feature of both array & linked list. So it is like each Node contains small Array List and all these nodes are linked.
So you can say Array List linked via Linked List.

For performance test i add some millions of items in Linked List/MultiValueLinkedList/Array List and iterate over it. I start with 10 Million element and increase it by 5 Million.


Add Performance 
Lets look at some numbers for add operation in different types of list


X Axis - Time taken for add in Ms
Y Axis - No of Element added in Millions

MultiValueLinkedList is clear winner in this test, it is around 22X times faster than Linked List.

Linked List performance is so bad that graph scale is screwed up and it is difficult to make out difference between MultiValueLinkedList & Array List, so i will put another graph for MultiValueLinkedList & Array List

ArrayList Vs MultiValueLinkedList

This graph gives better understanding of performance of Array List & MultiValueLinkedList, new linked list is 2.7X times faster than Array List.

 Iterate Performance
Look at iterate performance before i try to explain reason of performance boost



X Axis - Time taken for iterate/access in Ms
Y Axis - No of Element in Millions

As i said earlier that new Linked List will be hybrid one , it will take best from Linked List & Array List and this graph confirms that , it is somewhere in between Linked List & Array List.
MultiValueLinkedList is around 1X times faster than linked list

Garbage Collections
This test is all about allocation and will not be completed without GC activity graph, i want to put more GC observation but that will make this blog big, so i will put just one Visual GC graph

Visual GC Graph for Linked List

Visual GC Graph for MultiValueLinkedList


I don't have to explain much in these GC behavior, for Linked List minor GC activity is so much that object are promoted to old generation, all though it is not required.
GC behaves as expected for MultiValueLinkedList.

Observation
 - Performance of add is best for MultiValueLinkedList & it is due to less number of allocation & almost no need to copy the Array as it is required for Array List whenever it grows

 - Garbage produced by MultiValueLinkedList is very less. Memory usage of MultiValueLinkedList is same as ArrayList & for LinkedList it is double, that explains why we see object promotion to old generation.

- Iteration of MultiValueArrayList is slow as compared to Array List because memory access pattern is like Page by Page, it is possible to improve it further but adjusting the size of element in nodes.

Code is available @ github

Thursday, 25 July 2013

Which memory is faster Heap or ByteBuffer or Direct ?

Java is becoming new C/C++ , it is extensively used in developing High Performance System.
Good for millions of Java developer like me:-)

In this blog i will share my experiment with different types of memory allocation that can be done in java and what type of benefit you get with that.


Memory Allocation In Java
What type of support Java provide for memory allocation

 - Heap Memory
I don't i have to explain this, all java application starts with this.  All object allocated using "new" keyword goes under Heap Memory

- Non Direct ByteBuffer
It is wrapper over byte array, just flavor of Heap Memory.
ByteBuffer.allocate() can be used to create this type of object, very useful if you want to deal in terms of bytes not Object.

 - Direct ByteBuffer
This is the real stuff that java added since JDK 1.4.
Description of Direct ByteBuffer based on Java Doc

"A direct byte buffer may be created by invoking the allocateDirect factory method of this class. The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance."

Important thing to note about Direct Buffer is 
 - It is Outside of JVM
 - Free from Garbage Collector reach.

These are very important thing if you care about performance.
MemoryMapped file are also flavor of Direct byte buffer, i shared some of my finding with that in below blogs

arraylist-using-memory-mapped-file
power-of-java-memorymapped-file

- Off Heap or Direct Memory
This is almost same as Direct ByteBuffer but with little different, it can be allocated by unsafe.allocateMemory, as it is direct memory so it creates no GC overhead. Such type of memory must be manually released.

In theory Java programmer are not allowed to do such allocation and i think reason could be
 - It is complex to manipulate such type of memory because you are only dealing with bytes not object
 - C/C++ community will not like it :-)

Lets take deep dive into memory allocation

For memory allocation test i will use 13 byte of message & it is broken down into
 - int - 4 byte
 - long - 8 byte
 - byte - 1 byte

I will only test write/read performance, i am not testing memory consumption/allocation speed.

Write Performance



X Axis - No Of Reading
Y Axis - Op/Second in Millions











5 Million 13 bytes object are written using 4 types of allocation.
Direct ByteBuffer & Off Heap are best in this case, throughput is close to 350 Million/Sec
Normal ByteBuffer is very slow, TP is just 85 Million/Sec
Direct/Off Heap is around 1.5X times faster than heap


I did same test with 50 Million object to check how does it scale, below is graph for same.


X Axis - No Of Reading
Y Axis - Op/Second in Millions










Numbers are almost same as 5 Million.

Read Performance

Lets look at read performance


X Axis - No Of Reading
Y Axis - Op/Second in Millions











This number is interesting, OFF heap is blazing fast throughput for 12,000 Millions/Sec :-)
Only close one is HEAP read which is around 6X times slower than OFF Heap.

Look at Direct ByteBuffer , it is tanked at just 400 Million/Sec, not sure why it is so

Lets have look at number for 50 Million Object

X Axis - No Of Reading
Y Axis - Op/Second in Millions












Not much different.

Conclusion
Off heap via Unsafe is blazing fast with 330/11200 Million/Sec.
Performance for all other types of allocation is either good for read or write, none of the allocation is good for both.
Special note about ByteBuffer, it is pathetic , i am sure you will not use this after seeing such number.
DirectBytebuffer sucks in read speed, i am not sure why it is so slow

So if memory read/write is becoming bottle neck in your system then definitely Off-heap is the way to go, remember it is highway, so drive with care.

Code is available @ git hub

Update
Fixing broken code link - code available at github