Sunday, 9 June 2013

CPU cache access pattern

Things to know about low latency

If you are developing low latency application then you have to remove bottleneck that can cause your latency to increase, some of the bottleneck that comes to my mind

 - Input/Output - I/O has big effect in latency it will cause CPU to stall. SSD disk may be one option, but still you want to keep this out of core processing logic.

 - Network - Same effect as I/O, so you have to reduce Network read/write in your core logic, there are many software and hardware solution to improve this.

- Locks in code - Using non-blocking techniques.

- Keeping data in RAM - I will explore this option in blog.

So we can keep all data in RAM to get best latency possible. RAM is very cheap, so lets put every thing in RAM. In theory you should be able to keep CPU busy if all the data that you need for computing is available in RAM, but in practical it doesn't happen that way because of underlying datastructure used.
CPU stalls for some time due to slow access of RAM.

Experiment
Lets try simple experiment to prove that keeping all data in RAM is just not enough to achieve low latency.

Problem:
25 million instrument object are stored in memory and each object has 2 attribute
 - Instrument Id
 - Price 
Sets of input price is passed as input and all the instrument matching that price will be returned. It is a simple search based on price attribute.

Structure Of Instrument Object
Two types of instrument object is created for this test.

 - As simple java object
Instrument object is simple java class with 2 attribute

 class Instrument
{
public int id;
private int price;
public Instrument(int id, int price) {
super();
this.id = id;
this.price = price;
}

public int getId() {
return id;
}

public int getPrice() {
return price;
}
}

- Column type object
For each attribute array is created, this structure is more CPU cache friendly, we will why is is so later.
Sample ColumnInstrument Object

 class ColumnInstrumentStore
{

private int[] ids;
private int[] prices;

public ArrayInstrumentStore(int size)
{
ids = new int[size];
prices = new int[size];
}


public int getId(int index) {
return ids[index];
}


public int getPrice(int index) {
return prices[index];
}
public void setValue(int index, int id, int price) {
ids[index] = id;
prices[index] = price;
}

}

Lets measure performance
Lets try to measure search cost for 2 types of instrument object,

 X axis - No of reading
 Y axsis - Time taken in Ms

Linear search is performed on 25 million instrument object.
Instrument object defined as column is around 5% to 25% faster as compared to normal java based instrument object, on average it is 15% faster as compared to java based object.

Lets look at another number - Time take for each iteration. Time taken for 1 iteration is very less so i have used nano second to measure that.



 X axis - No of reading
 Y axis - Time taken in NanoSecond

This explains why search on column based objects is fast, time spent per iteration is around 15% less .

Why Column based object is fast
Lets look at the reason why search performs better on Column layout type of object as compared to normal object.

 - CPU Pre-Fetching - Today almost all the processor supports hardware pre-fetch. CPU pre-fetch data that is required by application to reduce memory latency

 - Better use of CPU cache - CPU moves data from main memory to cache via cache lines, in current generation of CPU, cache line is of 64 byte, so if you ask for 1 byte of data, processor will still bring 64 byte of data from ram to CPU cache line. This is called - Spatial locality, definition from wikipedia -

Spatial locality: if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. In this case it is common to attempt to guess the size and shape of the area around the current reference for which it is worthwhile to prepare faster access

This CPU behavior can be used to reduce round trip to ram by co-locating you data.

Lets take Instrument object to understand layout

- Object Layout


- Column Layout


In case of column based instrument, data is stored in integer array, if program request for single value of array , it will get next 15 element of array due to cache line transfer, but in case of normal object it will get only 7 instrument price.
Normal object layout requires double round trip to memory and that is the reason why it is slow.
If you co locate your objects then you can use CPU cache much more efficiently.

Most of the real object are more complex than Instrument object that i have used and you can imagine number of round trip it require to RAM.
You can get more details of memory latency from latency-number-that-you-should-know
Just to put those number in context -

L1 Cache : 0.5 ns
Main Memory  : 100 ns ( 20x times more than L1)


How it performs in multi-threaded world.
In today's time it is impossible to get machine with single core, we live in multi core world!
Each logical core has its own private cache, so if you data structure is cache friendly then it will scale linearly as you add more threads/core to your processing.

Lets look at some number for multi threaded search.

Search latency


Loop Iteration

Both search & iteration per second is improving as more number of threads are added, this will become more effective for NUMA machines because memory access is not uniform.

Just by having CPU cache friendly layout you can improve application performance
Column layout have more benefit, more on next blog.

1 comment: