This post is based on rum-conjecture paper for data systems. This paper talks about read, update & memory overhead of data structure and gives some nice examples on how balancing 2 factors leaves 3rd one in bad shape.
Source : http://daslab.seas.harvard.edu/rum-conjecture/ |
Some of the popular read optimized ordered Key-value data structure are BST , Skip List , Prefix Tree , B+ tree etc.
To make read fast it does some trade off on space.
In this post I will share space(i.e memory) efficient data structure that takes ideas from B+Tree and Skip List to create a new ordered data structure.
Quick recap before we build our own data structure.
B+ Tree
B+Tree is variant BST, every node contains n children. Tree contains 2 types of nodes root, internal & leaves node(i.e external).
Leaves node contains all the entries. Root and internal nodes will contain just keys and pointers to the leaf node.
Every node has some capacity and when it gets full then split process is used to create another node that will have half of the values from the original node. This splitting helps in keeping allocation in control and also reduces rebalancing overhead as compared to BST.
B+Tree. Source:Wikipedia |
This is a very popular data structure for databases, all relational databases and many non relational databases provide indexes that are backed up by B+Tree.
Its popularity has only increased over time.
Now some databases are built using Log structured merge tree that solves some of the issues with B+Tree.
One of the trade off with B+Tree is that keys are stored at multiple places(leaf & internal nodes) and splits of nodes can cause cascade effects.
Every value is wrapped up in some Node container and that causes extra 32 bit overhead per value and also extra indirection to get to value.
Extra indirection does not work well with current CPU that have big cache lines.
Skip List
Skip list takes ideas from the ordered link list and to make read and write efficient it adds upper layer or lane of sorted link list that has fewer values and acts like a fast lane to get to value.
Skip List. Source:Wikipedia |
This type of structure helps with concurrency because installing new value can be done with lock free operation like CAS.
One of the big trade-offs in skip lists is many keys are duplicated and level for key is decided randomly due to which every time skip levels could be very different although input is the same.
This also has other memory overhead due to extra container objects required to hold the values like B+Tree.
Let's look at new data structures that can take some ideas from these 2 data structures and reduce memory overhead and still maintain performance.
B+Skip List
I don't have any good name for this DS, so let's call it B+SkipList.
I will take few ideas
- Ordered memory block
B+Tree allocate nodes that can hold n values in an array. These values are stored in order by using simple array manipulation. This gives the nice property that related values are close together and have good chances to be in the same cache line.
- Split node when full
- Skip Lanes
Although SkipList is just an ordered link list but skip lane idea makes linked list like array, lower level nodes can be accessed quickly by skip lanes.
Let's create something !
Ordered memory block is straightforward to implement. It will look something like below
Ordered Array |
public OrderedIntCollection(int maxElement) {
this.maxElement = maxElement;
this.members = new int[1 + maxElement];
this.members[0] = Integer.MAX_VALUE;
this.length = 0;
}
public void insert(int value) {
this.min = Math.min(min, value);
this.max = Math.max(max, value);
int offSet = offSet(value, members, length);
if (members[offSet] == value) {
return; //This will skip copy overhead
}
shiftFrom(offSet);
members[offSet] = value;
length++;
}
Our first building block is ready, let's move to split logic now.
- Capacity check.
This is required to decide if split is required or not.
- Split logic.
This will do an actual split and return a new node.
- Additional metadata
Now nodes can be split so we need to track some meta data like min, max value.
Split process might look like below
Split is one way to handle when a node is full and it has some advantage but other options are also available like extending current nodes and it could be a useful strategy in some cases to reduce memory fragmentation. We will talk about this later.
Code snippet for split. It is simple array manipulation and makes sure that each node is independent and value can only contain 1 node.
public boolean hasCapacity(int count) {
return length + count < members.length;
}
public OrderedIntCollection split() {
int half = members.length / 2;
//Split and allocate new
int[] copy = new int[1 + maxElement];
int newLength = this.length - half;
System.arraycopy(members, half, copy, 0, newLength);
copy[newLength] = Integer.MAX_VALUE;
OrderedIntCollection collection = new OrderedIntCollection(maxElement, copy, newLength);
//Update current
this.members[half] = Integer.MAX_VALUE;
this.length = half;
this.max = members[half - 1];
Arrays.fill(members, half + 1, members.length, 0);
return collection;
}
Lets move to the final piece that is skip lanes.
With the split process now we can have multiple nodes and we have to keep track of all the nodes and maintain order of nodes.
Few things are required for lane data structure
- Dynamic list that allows you to add elements at any position.
- Ordering info for comparing nodes to identify which node comes first.
- Some way of finding which node will contain value.
- Some efficient algorithm to search values in all the available nodes.
Let's enhance OrderedIntCollection to add ordering info
public int max() {
return max;
}
public int min() {
return min;
}
@Override
public int compareTo(Integer value) {
if (value < min) return 1;
else if (value > max) return -1;
return 0;
}
Min/Max and compareTo function will help in having all the ordering info.
compareTo function will be later used to find which node can contain value.
With help of compareTo now we can store all the nodes in some array based data structure.
At high level it might look at something like below
This layout has some nice property like
- Keys are not duplicated.
- Child node are just at 2nd level
- Fanout of node is good.
- Data locality in single node.
Since nodes reference is stored in List based structure and are in order, so binary search can be used to identify nodes that are target for read or write.
Read/Write can be done in Log N time. Once a node is identified then it is sequential scan to locate value and that can add some cost but it is possible to use binary search within the node also to reduce overhead.
Trade off
Binary search can't be used 100% for writing.
Read request can be always served by binary search but for writing we need a hybrid approach, for eg in above sample if value is inserted that does not fall in any range then sequential search is required to find right node. Some of the sample value that can trigger sequential search are 1 (i.e min value), 300(i.e max value),105 (i.e comes in between node 2 and 3)
Fragmentation
public class OrderedInts {
private final List<OrderedIntCollection> values = new ArrayList<>();
private final int blockSize;
private final AtomicLong bsHit = new AtomicLong();
private final AtomicLong seqHit = new AtomicLong();
public OrderedInts(int blockSize) {
this.blockSize = blockSize;
}
public void insert(int value) {
OrderedIntCollection e = search(value);
e.insert(value);
}
private OrderedIntCollection search(int value) {
if (values.isEmpty()) {
OrderedIntCollection last = new OrderedIntCollection(blockSize);
values.add(last);
return last;
} else {
int position = binarySearch(values, value);
if (position >= 0) {
bsHit.incrementAndGet();
return splitIfRequired(value, position, values.get(position));
} else {
seqHit.incrementAndGet();
return sequentialSearch(value);
}
}
}
}