Are you ready: Low Memory

Showing posts with label Low Memory. Show all posts

Thursday, 24 December 2020

Ordered data structure using B+Tree and Skip List

This post is based on rum-conjecture paper for data systems. This paper talks about read, update & memory overhead of data structure and gives some nice examples on how balancing 2 factors leaves 3rd one in bad shape.

Source : http://daslab.seas.harvard.edu/rum-conjecture/

Some of the popular read optimized ordered Key-value data structure are BST , Skip List , Prefix Tree , B+ tree etc.

To make read fast it does some trade off on space.

In this post I will share space(i.e memory) efficient data structure that takes ideas from B+Tree and Skip List to create a new ordered data structure.

Quick recap before we build our own data structure.

B+ Tree

B+Tree is variant BST, every node contains n children. Tree contains 2 types of nodes root, internal & leaves node(i.e external).

Leaves node contains all the entries. Root and internal nodes will contain just keys and pointers to the leaf node.

Every node has some capacity and when it gets full then split process is used to create another node that will have half of the values from the original node. This splitting helps in keeping allocation in control and also reduces rebalancing overhead as compared to BST.

B+Tree. Source:Wikipedia

This is a very popular data structure for databases, all relational databases and many non relational databases provide indexes that are backed up by B+Tree.

Its popularity has only increased over time.

Now some databases are built using Log structured merge tree that solves some of the issues with B+Tree.

One of the trade off with B+Tree is that keys are stored at multiple places(leaf & internal nodes) and splits of nodes can cause cascade effects.

Every value is wrapped up in some Node container and that causes extra 32 bit overhead per value and also extra indirection to get to value.

Extra indirection does not work well with current CPU that have big cache lines.

Skip List

Skip list takes ideas from the ordered link list and to make read and write efficient it adds upper layer or lane of sorted link list that has fewer values and acts like a fast lane to get to value.

Skip List. Source:Wikipedia

This type of structure helps with concurrency because installing new value can be done with lock free operation like CAS.

One of the big trade-offs in skip lists is many keys are duplicated and level for key is decided randomly due to which every time skip levels could be very different although input is the same.

This also has other memory overhead due to extra container objects required to hold the values like B+Tree.

Let's look at new data structures that can take some ideas from these 2 data structures and reduce memory overhead and still maintain performance.

B+Skip List

I don't have any good name for this DS, so let's call it B+SkipList.

I will take few ideas

- Ordered memory block

B+Tree allocate nodes that can hold n values in an array. These values are stored in order by using simple array manipulation. This gives the nice property that related values are close together and have good chances to be in the same cache line.

- Split node when full

B+Tree split full node into 2 to balance level tree.

- Skip Lanes

Although SkipList is just an ordered link list but skip lane idea makes linked list like array, lower level nodes can be accessed quickly by skip lanes.

Let's create something !

Ordered memory block is straightforward to implement. It will look something like below

Ordered Array

Code snippet for ordered collection. I will take an example of int value for simplicity but the same thing applies for any custom object also.

public OrderedIntCollection(int maxElement) {
    this.maxElement = maxElement;
    this.members = new int[1 + maxElement];
    this.members[0] = Integer.MAX_VALUE;
    this.length = 0;
}

public void insert(int value) {
    this.min = Math.min(min, value);
    this.max = Math.max(max, value);

    int offSet = offSet(value, members, length);

    if (members[offSet] == value) {
        return; //This will skip copy overhead
    }

    shiftFrom(offSet);
    members[offSet] = value;
    length++;
}

Our first building block is ready, let's move to split logic now.

- Capacity check.

This is required to decide if split is required or not.

- Split logic.

This will do an actual split and return a new node.

- Additional metadata

Now nodes can be split so we need to track some meta data like min, max value.

Split process might look like below

Split is one way to handle when a node is full and it has some advantage but other options are also available like extending current nodes and it could be a useful strategy in some cases to reduce memory fragmentation. We will talk about this later.

Code snippet for split. It is simple array manipulation and makes sure that each node is independent and value can only contain 1 node.

public boolean hasCapacity(int count) {
    return length + count < members.length;
}
public OrderedIntCollection split() {
    int half = members.length / 2;

    //Split and allocate new
    int[] copy = new int[1 + maxElement];
    int newLength = this.length - half;
    System.arraycopy(members, half, copy, 0, newLength);
    copy[newLength] = Integer.MAX_VALUE;
    OrderedIntCollection collection = new OrderedIntCollection(maxElement, copy, newLength);

    //Update current
    this.members[half] = Integer.MAX_VALUE;
    this.length = half;
    this.max = members[half - 1];
    Arrays.fill(members, half + 1, members.length, 0);

    return collection;

}

Lets move to the final piece that is skip lanes.

With the split process now we can have multiple nodes and we have to keep track of all the nodes and maintain order of nodes.

Few things are required for lane data structure

- Dynamic list that allows you to add elements at any position.

- Ordering info for comparing nodes to identify which node comes first.

- Some way of finding which node will contain value.

- Some efficient algorithm to search values in all the available nodes.

Let's enhance OrderedIntCollection to add ordering info

public int max() {
    return max;
}

public int min() {
    return min;
}

@Override
public int compareTo(Integer value) {
    if (value < min) return 1;
    else if (value > max) return -1;
    return 0;
}

Min/Max and compareTo function will help in having all the ordering info.

compareTo function will be later used to find which node can contain value.

With help of compareTo now we can store all the nodes in some array based data structure.

At high level it might look at something like below

B+ Skip List

This layout has some nice property like

- Keys are not duplicated.

This is a big gain and helps in reducing memory requirements. Only additional memory allocation is for lists that maintain reference to nodes.

- Child node are just at 2nd level

Child nodes can be always reached in just 2 hops, as compared to n ( tree depth) in b+tree or SkipList.

- Fanout of node is good.

It gets fanout the same as B+Tree without overhead of additional internal nodes and it plays an important role in read/writes.

- Data locality in single node.

Some of the discussion after the RUM paper was published was to add Cache also as dimension to look at data structure, so it becomes CRUM.

Packing everything together helps with the "C" part as it provides better cache locality and this is one of the big gains because it can take some benefits of the cache line of the processors.

Since nodes reference is stored in List based structure and are in order, so binary search can be used to identify nodes that are target for read or write.

Read/Write can be done in Log N time. Once a node is identified then it is sequential scan to locate value and that can add some cost but it is possible to use binary search within the node also to reduce overhead.

Trade off

Some of the trade off

Binary search can't be used 100% for writing.

Read request can be always served by binary search but for writing we need a hybrid approach, for eg in above sample if value is inserted that does not fall in any range then sequential search is required to find right node. Some of the sample value that can trigger sequential search are 1 (i.e min value), 300(i.e max value),105 (i.e comes in between node 2 and 3)

Fragmentation

Due to splitting rules we might see many nodes that are half filled but that issue can be solved by relaxing the splitting rule like allow to choose between split or extend based on data distribution.

Compact nodes by merging content.

Code snippet of Skip Lane.

public class OrderedInts {
    private final List<OrderedIntCollection> values = new ArrayList<>();
    private final int blockSize;
    private final AtomicLong bsHit = new AtomicLong();
    private final AtomicLong seqHit = new AtomicLong();

    public OrderedInts(int blockSize) {
        this.blockSize = blockSize;
    }

    public void insert(int value) {
        OrderedIntCollection e = search(value);
        e.insert(value);
    }

    private OrderedIntCollection search(int value) {
        if (values.isEmpty()) {
            OrderedIntCollection last = new OrderedIntCollection(blockSize);
            values.add(last);
            return last;
        } else {
            int position = binarySearch(values, value);
            if (position >= 0) {
                bsHit.incrementAndGet();
                return splitIfRequired(value, position, values.get(position));
            } else {
                seqHit.incrementAndGet();
                return sequentialSearch(value);
            }
        }
    }

Conclusion

In computer science we only have trade off and nothing comes for free.

RUM way of looking at data structure provides key insights on trade-off and also provides some ideas around new data structure possibility.

Wednesday, 16 December 2020

Sparse Matrix

Vectors are a very powerful data structure and it allows to write/read data using index. These vectors can be combined to create a matrix. Matrix are used in numerical analysis.

One of the problems with matrix is that most of the real world matrix are sparse and the traditional way of allocating 2 Dimensional arrays to represent matrix can waste lots of space and also need lots of upfront memory.

In this post I will share some ways of storing sparse matrix.

In this example I will use Integer value as an example but these techniques can be used for any data type.

Dense Matrix

This is the most straightforward and default way. This is also most efficient for read and write access but needs a full matrix to be allocated before using it.

public class IntDenseVectorValue {

    final int[][] values;

    public IntDenseVectorValue(int xSize, int ySize) {
        this.values = new int[xSize][ySize];
    }

    public void set(int x, int y, int value) {
        values[x][y] = value;
    }


    public int get(int x, int y) {
        return values[x][y];
    }
}

Many real world data sets are sparse, so lots of space gets wasted in such situations.

Block Dense Matrix

This technique is based on creating small blocks of W width and each block is a small 2 dimension array.

During set/get X variables are used to identify block numbers.

This approach avoids upfront allocation and a very good fix for incremental data load. Indirection layer is added with help of Map actual block is located. Performance wise this is the same as a Dense vector but with less memory overhead.

Code snippet for block dense matrix.

public class IntDenseBlockVectorValue {
    final int blockSize;
    final int width;
    final int length;
    final Map<Integer, int[][]> values = new HashMap<>();

    public IntDenseBlockVectorValue(int x, int y, int blockSize) {
        this.length = x;
        this.width = y;
        this.blockSize = blockSize;
    }

    public void set(int x, int y, int value) {
        check(x, y);
        int blockIndex = x / blockSize;
        int[][] block = values.computeIfAbsent(blockIndex, index -> new int[blockSize][width]);
        int rowIndex = x % blockSize;
        block[rowIndex][y] = value;
    }

    private void check(int x, int y) {
        if (x >= length || y >= width) {
            throw new ArrayIndexOutOfBoundsException(String.format("Index [%s,%s] does not exists", x, y));
        }
    }

    public int get(int x, int y) {
        check(x, y);

        int blockIndex = x / blockSize;
        int[][] block = values.get(blockIndex);
        int rowIndex = x % blockSize;
        return block[rowIndex][y];
    }

}

Sparse Row Matrix

This technique takes block allocation to the next level by doing allocation at row level.

Map contains an entry for every row and a single dimension vector is used as a value.

This helps in reducing wastage to a great extent. If some rows are not added then that row never gets allocated.

Code snippet


public class IntSparseRowVectorValue {
    private final Map<Integer, int[]> values = new HashMap<>();
    private final int width;

    public IntSparseRowVectorValue(int width) {
        this.width = width;
    }

    public void set(int x, int y, int value) {
        int[] row = values.computeIfAbsent(x, index -> new int[width]);
        row[y] = value;
    }

    public int get(int x, int y) {
        int[] row = values.get(x);
        return row[y];
    }

}

Sparse Row/Column Matrix

This decomposed row vector to individual columns and gives ultimate memory gain with some trade off of read/write access.

This is modeled as Map of Map.

Code snippet

public class SparseRowColVectorValue {

    private final Map<Integer, Map<Integer, Integer>> values = new HashMap<>();

    public void set(int x, int y, int value) {
        Map<Integer, Integer> row = values.computeIfAbsent(x, index -> new HashMap<>());
        row.put(y, value);
    }

    public int get(int x, int y) {
        Map<Integer, Integer> row = values.get(x);
        return row.get(y);
    }

}

One thing to note in this approach is that the default map of JDK will need a wrapper of primitive type to store value and it can result in some performance overhead due to boxing/unboxing. This issue can be solved by having a primitive Map.

Conclusion

Each of sparse vector representation is special purpose and should be used based on usecase. All of these types can be mixed by having abstraction on top of it that will dynamically identify correct matrix implementation based on underlying data.

Tuesday, 23 June 2020

Bit fiddling every programmer should know

Bit fiddling looks like magic, it allows to do so many things in very efficient way.
In this post i will share some of the real world example where bit operation can be used to gain good performance.

Technology Basics: Bits and Bytes - Business Technology, Gadgets ...

Technology Basics: Bits and Bytes - Business Technology, Gadgets ...

Bit wise operation bootcamp
Bit operator include.
- AND ( &)
- OR ( | )
- Not ( ~)
- XOR( ^)
- Shifts ( <<, >>)

Wikipedia has good high level overview of Bitwise_operation. While preparing for this post i wrote learning test and it is available learningtest github project. Learning test is good way to explore anything before you start deep dive. I plan to write detail post on Learning Test later.

In these examples i will be using below bits tricks as building block for solving more complex problem.

countBits - Count number of 1 bits in binary
bitParity - Check bit added to binary code
set/clear/toggle - Manipulating single bit
pow2 - Find next power of 2 and using it as mask.

Code for these function is available @ Bits.java on github and unit test is available @ BitsTest.java

Lets look at some real world problems now.

Customer daily active tracking
E-commerce company keep important metrics like which days customer was active or did some business. This metrics becomes very important for building models that can be used to improve customer engagement. Such type of metrics is also useful for fraud or risk related usecase.
Investment banks also use such metrics for Stocks/Currency for building trading models etc.

Using simple bit manipulation tricks 30 days of data can be packed in only 4 bytes, so to store whole year of info only 48 bytes are required.

Code snippet

Apart from compact storage this pattern have good data locality because whole thing can be read by processor using single load operation.

Transmission errors
This is another area where bit manipulation shines. Think you are building distributed storage block management software or building some file transfer service, one of the thing required for such service is to make sure transfer was done properly and no data was lost during transmission. This can be done using bit parity(odd or even) technique, it involves keeping number of '1' bits to odd or even.

Another way to do such type of verification is Hamming_distance. Code snippet for hamming distance for integer values.

Very useful way to keep data integrity with no extra overhead.
Locks
Lets get into concurrency now. Locks are generally not good for performance but some time we have to use it. Many lock implementation are very heavy weight and also hard to share between programs .In this example we will try to build lock and this will be memory efficient lock, 32 locks can be managed using single Integer.

Code snippet

This example is using single bit setting trick along with AtomicInteger to make this code threadsafe.
This is very lightweight lock. As this example is related to concurrency so this will have some issues due to false sharing and it is possible to address this by using some of the technique mention in scalable-counters-for-multi-core post.

Fault tolerant disk
Lets get into some serious stuff. Assume we have 2 disk and we want to make keep copy of data so that we can restore data incase one of the disk fails, naive way of doing this is to keep backup copy of every disk, so if you have 1 TB then additional 1 TB is required. Cloud provider like Amazon will be very happy if you use such approach.
Just by using XOR(^) operator we can keep backup for pair of disk on single disk, we get 50% gain.
50% saving on storage expense.

Code snippet testing restore logic.

Disk code is available @ RaidDisk.java

Ring buffer
Ring buffer is very popular data structure when doing async processing , buffering events before writing to slow device. Ring buffer is bounded buffer and that helps in having zero allocation buffer in critical execution path, very good fit for low latency programming.
One of the common operation is finding slot in buffer for write/read and it is done by using Mod(%) operator, mod or divide operator is not good for performance because it stalls execution because CPU has only 1 or 2 ports for processing divide but it has many ports for bit wise operation.

In this example we will use bit wise operator to find mod and it is only possible if mod number is powof2. I think it is one of the trick that everyone should know.

n & (n-1)

If n is power of 2 then 'x & (n-1)' can be used to find mod in single instruction. This is so popular that it is used in many places, JDK hashmap was also using this to find slot in map.

Conclusion
I have just shared at very high level on what is possible with simple bit manipulation techniques.
Bit manipulation enable many innovative ways of solving problem. It is always good to have extra tools in programmer kit and many things are timeless applicable to every programming language.

All the code used in post is available @ bits repo.

Thursday, 7 May 2015

Experiment with String Split

Splitting string based on some token is very common operation in application, in this blog i will share some options that are mostly used and types of overhead involved in it.

String.split
This is the most common approach and it looks harmless unless you look at the code!
First it creates ArrayList for storing values and then this arraylist is converted to String array

This function produces too much of garbage and it provides only one way to extract values.

String Tokenizer
This is much better than String.split because it does not need intermediate buffer like ArrayList/String Array to hold the values, but it creates String objects for each token which adds to garbage.

One of the good thing about StringTokenizer is that it does not force caller to use fixed data structure , so caller is free to decide on what to do with values, it is just like Streaming operation.

String.split vs StringTokenizer
Lets look at some performance numbers. In this test i take below sample string

String[] values = {
        "this,is,simple,test",

        "lets,see,how,it,works",

        "this,is,very,simple,test"};

Each line is split 10 million times

So definitely StringTokenizer is winner in this case and it is because it produces less garbage as compared to String.split but it still produces string objects.

It is possible to avoid creating those String object also by using Recycle Charsequence which is just like String but gives lots of flexibility.

Lets look at another implementation using recycle charsequence approach.

It is very simple technique to avoid intermediate string object, lets look at performance number using this approach.

Recycle charSequence shines in this test, it is around 3X times faster than String.split an 2X times faster than StringTokenizer.

Code available @ github

Wednesday, 6 August 2014

Compact String List

Whenever application memory profile is analyzed string is one of the most common object that comes right on top .

Java has String pool that solves problem to some extent and lot of interesting optimization has been done in string pool for JDK 6/7/8

Whatever is given by string pool can be easily implemented by WeakHashmap or Concurrent hash map but JVM implementation is very good,so no point in reinventing it. One of the overhead associated with string object is header of char array, each array has basic header cost and extra 4 byte for size of array

On 32 bit for char array - 8(header) + 4(length) = 12
On 64 bit for char array - 12(header) + 4(length) = 16

For each string value 12 to 16 bits is wasted.
Quick memory optimization that can be done is to allocate one big array and store values of multiple string in that array.

Just to visualized how it will look

With above approach we save array header cost but another overhead is introduced that we need another sets of variable to know which part of array belong to value1 or value 2.
Int array can be used to maintain index of different value in big char array,so we save 12 bits per string and that is significant saving when you have lots of string.

In this blog i will share experiment with such approach.
Lets get into code

Storage
First thing is storage of multiple string values in single char array.

Pretty straight forward code two array is required one char[] and one int[]
Add function will expand char & int array if required and add values to it.

Iteration
Iteration over element is another tricky thing that needs to be handled for such compact structure, trove style foreach looks good fit for this.

Iteration code looks like

---

Memory Usage
Compact list trades off add speed for memory/search, lets have look at memory gain.
In this test text from ALICE'S ADVENTURES IN WONDERLAND url is split by space and it is added to ArrayList<String> and CompactStringList.

"Alice Adventure" book has 32.5K words.

For memory test i used Heinz Kabutz Determining Memory Usage in Java approach and it gave me consistent output so i stick with it for this test.

ArrayList takes around 1755 KB, CompactList takes around 355 KB.
So for this particular example CompactList takes around 80% less memory, gain is very significant.

Detail memory usage
Lets have look at detail memory usage. I used jmap to get top contributor for this test.

This gives better understanding of gain.
Char[] in compactlist takes around 60% less memory and String object is like almost nothing with minor overhead for int[].

So it seems good trade off for memory!

What next's
- One usage is building string pool using compactlist
- CompactList is append only structure any changes done for existing element like delete/update will require rebuilding CompactList
- Iteration using traditional style will result in garbage creation because it has to build CharSequence, but that can be overcome by using foreach approach that gives access to chars of element.

Code is available @ github