Are you ready: Performance

Showing posts with label Performance. Show all posts

Tuesday, 23 June 2020

Bit fiddling every programmer should know

Bit fiddling looks like magic, it allows to do so many things in very efficient way.
In this post i will share some of the real world example where bit operation can be used to gain good performance.

Technology Basics: Bits and Bytes - Business Technology, Gadgets ...

Technology Basics: Bits and Bytes - Business Technology, Gadgets ...

Bit wise operation bootcamp
Bit operator include.
- AND ( &)
- OR ( | )
- Not ( ~)
- XOR( ^)
- Shifts ( <<, >>)

Wikipedia has good high level overview of Bitwise_operation. While preparing for this post i wrote learning test and it is available learningtest github project. Learning test is good way to explore anything before you start deep dive. I plan to write detail post on Learning Test later.

In these examples i will be using below bits tricks as building block for solving more complex problem.

countBits - Count number of 1 bits in binary
bitParity - Check bit added to binary code
set/clear/toggle - Manipulating single bit
pow2 - Find next power of 2 and using it as mask.

Code for these function is available @ Bits.java on github and unit test is available @ BitsTest.java

Lets look at some real world problems now.

Customer daily active tracking
E-commerce company keep important metrics like which days customer was active or did some business. This metrics becomes very important for building models that can be used to improve customer engagement. Such type of metrics is also useful for fraud or risk related usecase.
Investment banks also use such metrics for Stocks/Currency for building trading models etc.

Using simple bit manipulation tricks 30 days of data can be packed in only 4 bytes, so to store whole year of info only 48 bytes are required.

Code snippet

Apart from compact storage this pattern have good data locality because whole thing can be read by processor using single load operation.

Transmission errors
This is another area where bit manipulation shines. Think you are building distributed storage block management software or building some file transfer service, one of the thing required for such service is to make sure transfer was done properly and no data was lost during transmission. This can be done using bit parity(odd or even) technique, it involves keeping number of '1' bits to odd or even.

Another way to do such type of verification is Hamming_distance. Code snippet for hamming distance for integer values.

Very useful way to keep data integrity with no extra overhead.
Locks
Lets get into concurrency now. Locks are generally not good for performance but some time we have to use it. Many lock implementation are very heavy weight and also hard to share between programs .In this example we will try to build lock and this will be memory efficient lock, 32 locks can be managed using single Integer.

Code snippet

This example is using single bit setting trick along with AtomicInteger to make this code threadsafe.
This is very lightweight lock. As this example is related to concurrency so this will have some issues due to false sharing and it is possible to address this by using some of the technique mention in scalable-counters-for-multi-core post.

Fault tolerant disk
Lets get into some serious stuff. Assume we have 2 disk and we want to make keep copy of data so that we can restore data incase one of the disk fails, naive way of doing this is to keep backup copy of every disk, so if you have 1 TB then additional 1 TB is required. Cloud provider like Amazon will be very happy if you use such approach.
Just by using XOR(^) operator we can keep backup for pair of disk on single disk, we get 50% gain.
50% saving on storage expense.

Code snippet testing restore logic.

Disk code is available @ RaidDisk.java

Ring buffer
Ring buffer is very popular data structure when doing async processing , buffering events before writing to slow device. Ring buffer is bounded buffer and that helps in having zero allocation buffer in critical execution path, very good fit for low latency programming.
One of the common operation is finding slot in buffer for write/read and it is done by using Mod(%) operator, mod or divide operator is not good for performance because it stalls execution because CPU has only 1 or 2 ports for processing divide but it has many ports for bit wise operation.

In this example we will use bit wise operator to find mod and it is only possible if mod number is powof2. I think it is one of the trick that everyone should know.

n & (n-1)

If n is power of 2 then 'x & (n-1)' can be used to find mod in single instruction. This is so popular that it is used in many places, JDK hashmap was also using this to find slot in map.

Conclusion
I have just shared at very high level on what is possible with simple bit manipulation techniques.
Bit manipulation enable many innovative ways of solving problem. It is always good to have extra tools in programmer kit and many things are timeless applicable to every programming language.

All the code used in post is available @ bits repo.

Thursday, 25 August 2016

Lazy evaluation

Recently i was writing log4j appender and wanted to use logger in it to log some diagnostic details during custom appender creation, but log4j initialization completes only after appender instance are created, so message logged during this phase are ignored.

I felt the need for lazy initialization in custom appender and started to look at options.

In this blog i will share things that i tried.

One of the thing that came to my mind was Singleton approach but now it is known fact that singleton causes problem with testing and make it impossible to extend it, so approach of mixing concurrency & object construction is not that good.

Incase if singleton is required then it is better to use Dependency Injection framework rather than spoiling your application code.

Lets get back to lazy initialization/eval.

Some programming language like scala/swift etc has support for lazy, so no custom code is required to do this but in java space we still have to write thread safe code to get it right.

Lets look at some options we have in java and what type of performance we get.

- Brute force using Synchronized
This is the most simple and inefficient one, scala is using this approach. Scala one is available @ ScalaLazy.java

- Double lock
This is little complex to write and gives good performance.

- Using Future task
This approach is simple to write and gives good performance.

Double lock approach gives the best performance and brute force one is worst. I did quick bench mark for 1 Million calls using different number of thread.

Single lock performance is very bad, lets have look at the number by removing single lock to see how Double Lock & Future Task performed.

These benchmark are done very quickly but detailed benchmark numbers should be close.

Code for this blog post is available @ github

Thursday, 23 July 2015

Efficiency with Algorithms

Recently had look at excellent talk on Efficiency with Algorithms, Performance with Data Structures , this talk has really some good content on performance.

In this blog i will share some of ideas from above talk and few things that i have learned.

Pre processing
This is very common trick, this is trade off between processing required when actual requests comes vs time taken to pre compute some thing, some of the good examples are.

IndexOf
This is very common string operation, almost all application needs this algorithm, java have brute force algorithm to solve this but this can become bottle neck very soon , so it is better to use Knuth–Morris–Pratt_algorithm algorithm, which pre computes table to get better performance.

Search
Sequential search vs binary search is trade off between search time or pre processing (i.e sorting) time, it starts to give return if number of compare required goes over 0(log n)

Binary search is heart of so many algorithm, this reduces expensive search operation.

Many String permutation search problems are solve using this.

Hashmap in java has nice optimization when many keys with same hashcode are added, to ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties

Indexes
Lookup is expensive operation and binary search can't answer all types of query,so you should build specialized index for fast search, some of the options are key/value, prefix/suffix , bitmap index, inverted index etc

Pre Allocate
This is classic one, so if you know how big your data structure will be then it is better to pre allocate it rather than keep it expanding multiple times and incur cost of allocation & copy.

Array based data structure doubles capacity every time re-allocation is required and this is done to amortized allocation cost, so they do big and less frequent allocation, in java world ArrayList, HashMap etc are good example of that.

Many design pattern like Object Pooling/Fly Weight etc are based on this.

Reduce Expensive/Duplicate operation
Each algorithm has some expensive operation and to make it efficient those expensive operation should be reduced, some of the common example are

HashCode
Hash code computation is expensive operation, so this should be not be computed multiple times for given key. One way is compute it once and pass it to other functions , so recomputation is not required or pre-compute it for eg String class in java pre compute hash code to save some time.

Modulus /Divide
This is one of the expensive arithmetic operation, bit map operation can be used to do same thing but at lower cost.
If data structure is circular for eg Circular array and want to put value in free slot then Modulus operation is required and if capacity of array is power of 2 then bit wise operation ( i.e index & ( capacity-1) ) can be used to get Modulus value and this will give significant performance gain at the cost of extra memory. This technique is used by HashMap in java

Similarly right shift ( >>) operation can be used for divide to get better performance, but now a days compiler are smart, so you get this one for free, no need to write it.

Reduce copy overhead
Array based data structure amortized re-sizing cost by increasing capacity by 2 times , this is good but overhead of copy also comes along with it, another approach is chain of arrays, so this way you only allocate one big chunk but don't have to do copy of old value, just add this new block of memory to chain of allocated blocks.
gs-collection has CompositeFastList which is build using this approach.

Batching
Expensive operation like write to file/network/database etc should be batched to get best performance.

Co-locate data
Keeping all required data together gives very good performance based on how processor works, but this is one thing that is broken in many application.
Mike Acton talk about this in Data-Oriented Design and C++ talk in details.

Column based storage are very good analytic/aggregation use case because most of the time single column data for all rows are required to answer analytic/aggregation request.

Linear probing hash tables are also good example of data co-location.

Bit Packing is another option to keep required data very close. but should be used with extra caution because this can make code complex.

Unnecessary Work
Some of algorithm suffer from this problem most common are recursive algorithm like factorial or Fibonacci etc, both of this has duplicate work problem, which can be fixed by Memoization technique.

Simplicity & readability of code should not be lost during efficiency ride because next guy has to maintain it :-)

Thursday, 7 May 2015

Experiment with String Split

Splitting string based on some token is very common operation in application, in this blog i will share some options that are mostly used and types of overhead involved in it.

String.split
This is the most common approach and it looks harmless unless you look at the code!
First it creates ArrayList for storing values and then this arraylist is converted to String array

This function produces too much of garbage and it provides only one way to extract values.

String Tokenizer
This is much better than String.split because it does not need intermediate buffer like ArrayList/String Array to hold the values, but it creates String objects for each token which adds to garbage.

One of the good thing about StringTokenizer is that it does not force caller to use fixed data structure , so caller is free to decide on what to do with values, it is just like Streaming operation.

String.split vs StringTokenizer
Lets look at some performance numbers. In this test i take below sample string

String[] values = {
        "this,is,simple,test",

        "lets,see,how,it,works",

        "this,is,very,simple,test"};

Each line is split 10 million times

So definitely StringTokenizer is winner in this case and it is because it produces less garbage as compared to String.split but it still produces string objects.

It is possible to avoid creating those String object also by using Recycle Charsequence which is just like String but gives lots of flexibility.

Lets look at another implementation using recycle charsequence approach.

It is very simple technique to avoid intermediate string object, lets look at performance number using this approach.

Recycle charSequence shines in this test, it is around 3X times faster than String.split an 2X times faster than StringTokenizer.

Code available @ github

Saturday, 20 July 2013

ArrayList Using Memory Mapped File

Introduction
In-Memory computing is picking up due to affordable hardware, most of the data is kept in RAM to meet latency and throughput goal, but keeping data in RAM create Garbage Collector overhead especially if you don't pre allocate.
So effectively we need garbage less/free approach to avoid GC hiccups

Garbage free/less data structure
There are couple of option to achieve it
- Object Pool
Object pool pattern is very good solution, i wrote about that in Lock Less Object Pool blog

- Off Heap Objects
JVM has very good support for creating off-heap objects. You can get rid of GC pause if you take this highway and highway has its own risk!

-MemoryMapped File
This is mix of Heap & Off Heap, like best of world.

Memory mapped file will allow to map part of the data in memory and that memory will be managed by OS, so it will create very less memory overhead in JVM process that is mapping file.
This can help in managing data in garbage free way and you can have JVM managing large data.
Memory Mapped file can be used to develop IPC, i wrote about that in power-of-java-memorymapped-file blog

In this blog i will create ArrayList that is backed up by MemoryMapped File, this array list can store millions of object and with almost no GC overhead. It sounds crazy but it is possible.

Lets gets in action
In this test i use Instrument object that has below attribute
- int id
- double price

So each object is of 12 byte.
This new Array List holds 10 Million Object and i will try to measure writer/read performance

Writer Performance

X Axis - No Of Reading
Y Axis - Time taken to add 10 Million in Ms

Adding 10 Million element is taking around 70 Ms, it is pretty fast.

Writer Throughput

Lets look at another aspect of performance which is throughput

X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions

Writer throughput is very impressive, i ranges between 138 Million to 142 Million

Reader Performance

X Axis - No Of Reading
Y Axis - Time taken to read 10 Million in Ms

It is taking around 44 Ms to read 10 Million entry, very fast. With such type of performance you definitely challenge database.

Reader Throughput

X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions

Wow Throughput is great it is 220+ million per second

It looks very promising with 138 Million/Sec writer throughput & 220 Million/Sec reader throughput.

Comparison With Array List
Lets compare performance of BigArrayList with ArrayList,

Writer Throughput - BigArrayList Vs ArrayList

Throughput of BigArrayList is almost constant at around 138 Million/Sec, ArrayList starts with 50 Million and drops under 5 million.

ArrayList has lot of hiccups and it is due to
- Array Allocation
- Array Copy
- Garbage Collection overhead

BigArrayList is winner in this case, it is 7X times faster than arraylist.

Reader Throughput - BigArrayList Vs ArrayList

ArrayList performs better than BigArrayList, it is around 1X time faster.

BigArrayList is slower in this case because
- It has to keep mapping file in memory as more data is requested
- There is cost of un-marshaling

Reader Throughput for BigArrayList is 220+ Million/Sec, it is still very fast and only few application want to process message faster than that.
So for most of the use-case this should work.

Reader performance can be improved by using below techniques
- Read message in batch from mapped stream
- Pre-fetch message by using Index, like what CPU does

By doing above changes we can improve performance by few million, but i think for most of the case current performance is pretty good

Conclusion
Memory mapped file is interesting area to do research, it can solve many performance problem.
Java is now being used for developing trading application and GC is one question that you have to answer from day one, you need to find a way to keep GC happy and MemoryMapped is one thing that GC will love it.

Code used for this blog is available @ GitHub , i ran test with 2gb memory.
Code does't handle some edge case , but good enough to prove the point that that MemoryMapped file can be winner in many case.

Sunday, 19 May 2013

Lock Less Java Object Pool

It is being while i wrote anything, i has been busy with my new job that involve doing some interesting work in performance tuning. One of the challenge is to reduce object creation during critical part of application.

Garbage Collection hiccups has been main pain point in java for some time, although java has improved over time with GC algorithmic. Azul is market leader developing pause less GC but Azul JVM are not free as speech!

Creating too many temporary/garbage object does't work too well because it create work for GC and it is going to have effect on latency, too much garbage also does't work well with multi core system because it causes cache pollution.

So how should we fix this ?

Garbage less coding
This is only possible if you know how many object you need upfront and pre-allocate them, but in reality that is very difficult to find that , but in-case if you still managed to do that then you have to worry about another issue

You might not have enough memory to hold all the object you need
You have to handle concurrency

So what is the solution for above problem

There is Object Pool design pattern that can address both of the above issue,it lets you to specify num of object that you need in pool and handles concurrent request to serve object request.

Object Pool has been base of many application that has low latency requirement, flavor of object pool is Flyweight design pattern.

Both of above pattern will help us in avoiding object creation, that is great so now GC work is reduced and in theory our application performance should improve but in practical does't happen that way because Object Pool/Flyweight has to handle concurrency and whatever advantage you get by avoiding object creation is lost because of concurrency issue.

What are most common way to handle concurrency

Object pool is typical producer/consumer problem and it can be solved by using following techniques

Synchronized - This was the only way to handle concurrency before JDK 1.5, apache has written wonder full object pool API based on synchronized

Locks - Java added excellent support for concurrent programming since JDK 1.5, there has been some work to use Locks to develop Object Pool for eg furious-objectpool

Lock Free - I could not find any implementation that is built using fully lock free technique, but furious-objectpool use mix of ArrayBlocking queue & ConcurrentLinked queue

Lets measure performance

In this test i have created pool of 1 Million object and those object are accessed by different pool implementation, objects are taken from pool and return back to pool.

This test first starts with 1 thread and then number of threads are increased to measure how different pool implementation perform under contention

X Axis - No Of Threads

Y Axis - Time in Ms - Lower time is better

This test include pool from Apache, Furious Pool & ArrayBlocking based Pool

Apache one is worst and as number of threads increase performance degrades further and reason for same is Apache pool is based on heavy use of "synchronized"

Other two Furious & ArrayBlocking based pool performs better but both of them also slows down as contention increase.

ArrayBlocking queue based pool takes around 1000 ms for 1 Million items when 12 threads are trying to access the pool, Furious pool which internally uses Arrayblocking queue takes around 1975 ms for same thing.

I have to do some more detail investigation to find out why Furious is taking double time because it is also based on ArrayBlocking queue.

Performance of arrayblocking queue is decent but it is lock based approach, what type of performance we can get if we can implement lock free pool.

Lock free pool.

Implementing lock free pool is not impossible but bit difficult because you have to handle multiple producer & consumer.

I will implement hybrid pool which will use lock on the producer side & non blocking technique on the consumer side.

Lets have look some numbers

I performed same test with new implementation (FastPool) and it is almost 30% faster than ArayBlocking queue.

30% improvement is not bad, it can definitely help is meeting latency goal.

What makes Fast Pool fast!

I used couple of technique to make it fast

Producer are lock based - Multiple producer are managed using locks, this is same as Array Blocking queue, so nothing great about this.

Immediate publication of released item - it publishes element before lock is released using cheap memory barrier. This gives some gain

Consumer are non blocking - CAS is used to achieve this, consumer are never blocked due to producer. Array Blocking queue blocks consumer because it use same lock for producer & consumer

Thread Local to maintain value locality - Thread Local is used to acquire last value that was used, this reduces contention to great extent.

If you are interested in having look at code then it is available @ FastObjectPool.java

Resources
java-object-tutorial

Wednesday, 16 January 2013

Java Performance MindMap

Found interesting thing about java performance

Reference - http://igormaznitsa.com/mindmaps/JavaPerformanceMindMap.png

Friday, 11 January 2013

Java Reflection Facts

Java has wonderful feature that allow to inspect any object at run time and extract useful information about it for e.g constructor, methods, fields etc.

Reflection provides flexibility at run time to make decision without writing IF/ELSE, SWITH-CASE, it is used extensively in java and there are lot of framework developed based on it for eg ORM, Rule engine etc

We have to trade off performance for the cost of flexibility that reflections give to us specially if you execute any method via reflection, lets try to measure cost and see what can be done to improve it.

Simple Value Object
Take a simple class Order which has 2 property ( orderId,orderType)

public class Order {
private int orderId;
private String orderType;

public int getOrderId() {
return orderId;
}

public void setOrderId(int orderId) {
this.orderId = orderId;
}

public void setOrderType(String orderType) {
this.orderType = orderType;
}
public String getOrderType() {
return orderType;
}
}

Method Invocation

getOrderId function is called 10 Million time in loop and reflection is around 17X times slow for this test
That is very high cost paid for flexibility!

So what can be done, JDK 1.7 added more feature related to reflection. It is suppose to make life easy for developer by moving lot of plumbing code to java API. MethodHandles class is supposed to do all the magic related to reflection use case, Lets try to measure it for our simple case.

Wow it is very very slow 106X times slow compared reflection, i don't want to compare with with normal method call. MethodHandles gives good abstraction for reflection, but on performance side it is very slow, so i am sure you want to re think before you use it.

What are the other options
What are the options to make it fast
- Go back to normal method call by writing big IF-ELSEIF-ELSE for each method
- Use some native call to perform reflection.

First option looks such a naive thing, you don't want to write all that and if your class has lot of function then it become nightmare to maintain it. There are some byte code manipulation API that we can use to generate such class. There are couple of options ASM, javassist for that.

May java developer don't want to use second option because you want to stay away from native code due to portability issue, but it is worth trying it to see what type of performance we get. java has Unsafe class which is used internally by java for many things, i will also try to use this to find alternate ways of reflection.

Below is the chart for call of getOrderType 10 Million times

Wow this is some thing, using compiled class ( via javaassit) & Unsafe we can come close to normal method call. I will remove JDK 7 from graph, so that we have some proper scale in graph.

Now it is better, Unsafe is almost same as Normal method. Compiled is better than reflection, it amount 1.8X times faster than reflection.

There are some options to improve performance of "Compiled" class, the current implementation has below code

if("getOrderType".equals(s))

return (String)((com.atomic.RefTest.Order)obj).getOrderType();

if("getOrderId".equals(s))

return Integer.valueOf(((com.atomic.RefTest.Order)obj).getOrderId());

There are if condition for making decision which method to call, if we can remove that we might get some performance improvement, so now it will look like

((com.atomic.ReflectionTest.Order)obj).getOrderType();

Below is graph with that change - CompiledSingleProp is one with no IF condition

Performance of CompiledSingleProp is similar to Unsafe and close to Normal function call.

So we can definitely use "Option 1" for reflection without significant performance impact on application.

Performance compare Object vs primitive property

While doing this test i notice performance of primitive property is slow as compared to object property and reason is because we have to create Wrapper class to represent primitive type and it adds overhead.

Below is graph for comparing primitive vs object property

Compiled class performs better in case of object and it is slow for primitive type(int/float/long), but Unsafe performance has not effect on type of property, so based on use case proper alternate solution can be used to get best performance.

Conclusion

There are many alternate of reflection available, this post talks about few. Compiled class & Unsafe are the most efficient way to perform property get/set calls. These options should be consider if your application is using reflections heavily. Lot of professional tools uses's compiled property.

About Sample Code

Code available @ github

For unsafe i am not generating any code, it can easily done using logic similar to PropertyGenerator

Java Reflection Example