Sunday 13 December 2020

Sorting under constraint algorithms

In the compiler world code inlining is the most important optimization that enables other optimization like dead code elimination, pre fetching , out of order execution etc.

Similarly sorting also enables many optimizations like binary search , streaming aggregation, range search operations, prefix compression, run length encoding , delta encoding, understanding trends, posting list, data partition  etc.



Sorting is memory and compute intensive work and many times we don't have enough compute/memory to do it.

In this post I will share 2 sorting algorithms that can be used when a system has memory or CPU constraint.

Disk based sorting

We have file containing 1 TB data and we want to sort it. Data is huge to it is not possible to use standard in-memory sorting algorithm for this. 

One way to handle sorting of such data is split it in chunks, sort the chunk in memory, persist chunk to disk and finally merge the sorted chunk using k way merge algorithm.


At high level sort pipeline will look something like below 



Nice thing about this algorithm is that it is Embarrassingly_parallel.

This algorithm is also good example of Divide-and-conquer_algorithm and this technique can be applied both the stages.

This algorithm has 2 stages in the pipeline that can be executed in parallel to take advance of multiple cores.

Lets assume that input file contains 10 Million then it can be decomposed in Split stage


In merge stage we have to do reverse operations of taking multiple files and creating single one.




Split & Merge has different types of compute/disk requirement and it is possible to make both the stage parallel or just one based on constraint.

Overall sort pipeline will look like below.  



This algorithm is used by many databases to manage result sets that can't be fit into memory. 

Important logic in this algorithm is K-Way merge of sorted results. If K is 2 then it is straight forward.

2 Way merge

Merge process is pick head from both iterators and add the value that is less, move pointer of iterator whose value was added.

Need to handle some edge conditions to avoid buffer overflow while reading and handling iterators of different size.


v1.next();
v2.next();

while (v1.hasNext() && v2.hasNext()) {
value1 = v1.value();
value2 = v2.value();
if (isLessThan(value1, value2)) {
buffer.add(value1);
v1.next();
} else {
buffer.add(value2);
v2.next();
}
}

if (v1.hasNext()) {
append(buffer, v1);
} else if (v2.hasNext()) {
append(buffer, v2);
}

K Way merge

Assume K is 4, then one way to merge is to split the whole list in pairs of 2, keep merging in pairs and finally start merge out of 2 way merges. This is a good algorithm but can't take advantage of batching multiple iterators.

Recommended way is to use Heap of K values. This is more efficient as we can process multiple inputs in a single pass and can reduce IO overhead also. 

PriorityQueue<LineIterator> heap=...
LineIterator itr;

while ((itr = heap.poll()) != null) {
write(writer, itr.value());
itr.next();
if (itr.hasNext()) {
heap.add(itr);
}
}

BitMap Sorting

Bitmap is a powerful data structure for searching and has some interesting properties for sort also.

Consider a scenario where the file contains n positive integer and each value is less than K.

K can be really huge depending on max value, to put this in context just by using 256 MB memory billions of int values can be sorted.

Idea is based around an allocated array with every element of K word (i.e 32 or 64). If we used 32 bit words then 32 values can be stored in every slot. Total capacity of this data structure is 32 * len(array).

Setting bit needs 2 information, slot in array and position in that slot.




Bit fiddling enables to pack multiple values in a single word, you want to read more on bit fiddling then refer to bit-fiddling.

In this example bytes is 4 and word size is 32.

Checking for value is straightforward and it involves doing bit wise & on Slot value.

Complete working code 

public static final int NO_OF_BYTE = 4;
private final int WORD_SIZE = 8 * NO_OF_BYTE;
private final int SLOT_SHIFT = NO_OF_BYTE + 1;
private final int[] values;
private final int maxValue;

public BitMapSort(int maxValue) {
this.maxValue = maxValue;
this.values = new int[1 + maxValue / WORD_SIZE];
logUsageInfo(maxValue);
}


public void set(int v) {
this.values[slot(v)] |= position(v);
}

public boolean check(int v) {
int value = this.values[slot(v)] & position(v);
return value != 0;
}

public int position(int v) {
return 1 << (v & WORD_SIZE - 1);
}

public int slot(int v) {
return v >> SLOT_SHIFT;
}


Whole pipeline will look something like below


Trade Off


Nothing is perfect and this also has some constraints and it is good to be aware of it.

- This is a dense value data structure, so if we have to store a value that is of 100 Million then we have to allocate at least 100 million bits ( 95 MB). If values are sparse then find alternate data structure. 

- Thread safety has to be handled at slot level because 32 values are packed in a single slot.

- Values should be distinct but if duplicate values are present and it is going to be less duplicates then additional data structures like maps can be used to keep frequency count. This needs to be handled in a little intelligent way like having some threshold on duplicate value and if it crosses that threshold then it is better to stop accepting value to avoid having everything going to map.

- Iteration. Since this is a compressed representation of dense value, iteration on available value has to be handled in a streaming approach to avoid allocation of huge in memory collection. One of the approach could be having API for consuming single value at a time and let client to decide on what to do with those values, example of such iteration could look something like below

public void consume(IntConsumer consumer) {
IntStream
.range(1, maxValue)
.filter(this::check)
.forEach(consumer::accept);
}

- Range iteration. This data structure is very good for range query.

- Compact set. This is also good DS for set related operations.

Conclusion

These are simple and yet very powerful algorithms and if this fits the bill then it can be the difference between solving the problem or not solving at all. 

2 comments:


  1. Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work.
    Best Artificial Intelligence Company in Dubai

    ReplyDelete