Are you ready: Collection

Showing posts with label Collection. Show all posts

Wednesday, 23 September 2015

Collection resize and merge policy

Collection like ArrayList/Vector are extension of plain array with no size restriction, these collection has dynamic grow/shrink behavior.

Growing/shrinking array is expensive operation and to amortized that cost different types of policy are used.

In this blog i will share some of the policy that can be used if you are building such collection.

Growing
To decided best policy to extend we need to answer question
- When to extend collection
- By how much

When part is easy to answer this is definitely when element are added, but main trick is in how much.
Java collection(i.e ArrayList) grow it by 50% every time, which is good option because it reduces frequent allocation & array copy, only problem with this approach is that most of the time collection is not fully filled.

There are some option to avoid wastage of space
- If you know how much element will be required then create collection with that initial size, so that no re-sizing is required
- Have control on how much it grows , this is easy to implement by making growth factor as parameter to collection.

Below is code snippet for Growing list

Shrink
Remove operation on collection is good hook to decided when to reduce the size and for how much some factor can be used for e.g 25%, if after remove collection is 25% free then shrink to current size.

Java ArrayList does not shrink automatically when element are removed , so this should be watched when collection is reused because it will never shrink and in may production system collections are only half filled. It will be nice to have control on shrinking when elements are deleted.

Code snippet for auto shrinking

Hybrid Growth/Shrink
Above allocation/de-allocation has couple of issue which can have big impact on performance for e.g.
- Big array allocation
- Copy element

Custom allocation policy can be used to overcome this problem by having one big container that has multiple slots and element are stored in it.

Element are always appended to last pre allocated slot and when it has no capacity left then create new slot and start using it, this custom allocation has few nice benefit for e.g

- No big allocation is required
- Zero copy overhead
- Really big list(i.e greater than MaxInteger) can be created
- Such type of collection is good for JDK8 collectors, because merge cost is very low.

Nothing is free this comes with some trade off
- Access element by index can be little slow because it has to work out which slot has item

Code snippet for slots based collection.

Code used in blog is available @ github

Thursday, 23 July 2015

Efficiency with Algorithms

Recently had look at excellent talk on Efficiency with Algorithms, Performance with Data Structures , this talk has really some good content on performance.

In this blog i will share some of ideas from above talk and few things that i have learned.

Pre processing
This is very common trick, this is trade off between processing required when actual requests comes vs time taken to pre compute some thing, some of the good examples are.

IndexOf
This is very common string operation, almost all application needs this algorithm, java have brute force algorithm to solve this but this can become bottle neck very soon , so it is better to use Knuth–Morris–Pratt_algorithm algorithm, which pre computes table to get better performance.

Search
Sequential search vs binary search is trade off between search time or pre processing (i.e sorting) time, it starts to give return if number of compare required goes over 0(log n)

Binary search is heart of so many algorithm, this reduces expensive search operation.

Many String permutation search problems are solve using this.

Hashmap in java has nice optimization when many keys with same hashcode are added, to ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties

Indexes
Lookup is expensive operation and binary search can't answer all types of query,so you should build specialized index for fast search, some of the options are key/value, prefix/suffix , bitmap index, inverted index etc

Pre Allocate
This is classic one, so if you know how big your data structure will be then it is better to pre allocate it rather than keep it expanding multiple times and incur cost of allocation & copy.

Array based data structure doubles capacity every time re-allocation is required and this is done to amortized allocation cost, so they do big and less frequent allocation, in java world ArrayList, HashMap etc are good example of that.

Many design pattern like Object Pooling/Fly Weight etc are based on this.

Reduce Expensive/Duplicate operation
Each algorithm has some expensive operation and to make it efficient those expensive operation should be reduced, some of the common example are

HashCode
Hash code computation is expensive operation, so this should be not be computed multiple times for given key. One way is compute it once and pass it to other functions , so recomputation is not required or pre-compute it for eg String class in java pre compute hash code to save some time.

Modulus /Divide
This is one of the expensive arithmetic operation, bit map operation can be used to do same thing but at lower cost.
If data structure is circular for eg Circular array and want to put value in free slot then Modulus operation is required and if capacity of array is power of 2 then bit wise operation ( i.e index & ( capacity-1) ) can be used to get Modulus value and this will give significant performance gain at the cost of extra memory. This technique is used by HashMap in java

Similarly right shift ( >>) operation can be used for divide to get better performance, but now a days compiler are smart, so you get this one for free, no need to write it.

Reduce copy overhead
Array based data structure amortized re-sizing cost by increasing capacity by 2 times , this is good but overhead of copy also comes along with it, another approach is chain of arrays, so this way you only allocate one big chunk but don't have to do copy of old value, just add this new block of memory to chain of allocated blocks.
gs-collection has CompositeFastList which is build using this approach.

Batching
Expensive operation like write to file/network/database etc should be batched to get best performance.

Co-locate data
Keeping all required data together gives very good performance based on how processor works, but this is one thing that is broken in many application.
Mike Acton talk about this in Data-Oriented Design and C++ talk in details.

Column based storage are very good analytic/aggregation use case because most of the time single column data for all rows are required to answer analytic/aggregation request.

Linear probing hash tables are also good example of data co-location.

Bit Packing is another option to keep required data very close. but should be used with extra caution because this can make code complex.

Unnecessary Work
Some of algorithm suffer from this problem most common are recursive algorithm like factorial or Fibonacci etc, both of this has duplicate work problem, which can be fixed by Memoization technique.

Simplicity & readability of code should not be lost during efficiency ride because next guy has to maintain it :-)