In the compiler world code inlining is the most important optimization that enables other optimization like dead code elimination, pre fetching , out of order execution etc.
Similarly sorting also enables many optimizations like binary search , streaming aggregation, range search operations, prefix compression, run length encoding , delta encoding, understanding trends, posting list, data partition etc.
Sorting is memory and compute intensive work and many times we don't have enough compute/memory to do it.
In this post I will share 2 sorting algorithms that can be used when a system has memory or CPU constraint.
Disk based sorting
We have file containing 1 TB data and we want to sort it. Data is huge to it is not possible to use standard in-memory sorting algorithm for this.
One way to handle sorting of such data is split it in chunks, sort the chunk in memory, persist chunk to disk and finally merge the sorted chunk using k way merge algorithm.
At high level sort pipeline will look something like below
Nice thing about this algorithm is that it is Embarrassingly_parallel.
This algorithm is also good example of Divide-and-conquer_algorithm and this technique can be applied both the stages.
This algorithm has 2 stages in the pipeline that can be executed in parallel to take advance of multiple cores.
Lets assume that input file contains 10 Million then it can be decomposed in Split stage
In merge stage we have to do reverse operations of taking multiple files and creating single one.
Split & Merge has different types of compute/disk requirement and it is possible to make both the stage parallel or just one based on constraint.
Overall sort pipeline will look like below.
This algorithm is used by many databases to manage result sets that can't be fit into memory.
Important logic in this algorithm is K-Way merge of sorted results. If K is 2 then it is straight forward.
2 Way merge
v1.next();
v2.next();
while (v1.hasNext() && v2.hasNext()) {
value1 = v1.value();
value2 = v2.value();
if (isLessThan(value1, value2)) {
buffer.add(value1);
v1.next();
} else {
buffer.add(value2);
v2.next();
}
}
if (v1.hasNext()) {
append(buffer, v1);
} else if (v2.hasNext()) {
append(buffer, v2);
}
K Way merge
PriorityQueue<LineIterator> heap=...
LineIterator itr;
while ((itr = heap.poll()) != null) {
write(writer, itr.value());
itr.next();
if (itr.hasNext()) {
heap.add(itr);
}
}
BitMap Sorting
Bitmap is a powerful data structure for searching and has some interesting properties for sort also.
Consider a scenario where the file contains n positive integer and each value is less than K.
K can be really huge depending on max value, to put this in context just by using 256 MB memory billions of int values can be sorted.
Idea is based around an allocated array with every element of K word (i.e 32 or 64). If we used 32 bit words then 32 values can be stored in every slot. Total capacity of this data structure is 32 * len(array).
Setting bit needs 2 information, slot in array and position in that slot.
Bit fiddling enables to pack multiple values in a single word, you want to read more on bit fiddling then refer to bit-fiddling.
In this example bytes is 4 and word size is 32.
Checking for value is straightforward and it involves doing bit wise & on Slot value.
Complete working code
public static final int NO_OF_BYTE = 4;
private final int WORD_SIZE = 8 * NO_OF_BYTE;
private final int SLOT_SHIFT = NO_OF_BYTE + 1;
private final int[] values;
private final int maxValue;
public BitMapSort(int maxValue) {
this.maxValue = maxValue;
this.values = new int[1 + maxValue / WORD_SIZE];
logUsageInfo(maxValue);
}
public void set(int v) {
this.values[slot(v)] |= position(v);
}
public boolean check(int v) {
int value = this.values[slot(v)] & position(v);
return value != 0;
}
public int position(int v) {
return 1 << (v & WORD_SIZE - 1);
}
public int slot(int v) {
return v >> SLOT_SHIFT;
}
Whole pipeline will look something like below
Trade Off
- This is a dense value data structure, so if we have to store a value that is of 100 Million then we have to allocate at least 100 million bits ( 95 MB). If values are sparse then find alternate data structure.
- Thread safety has to be handled at slot level because 32 values are packed in a single slot.
- Values should be distinct but if duplicate values are present and it is going to be less duplicates then additional data structures like maps can be used to keep frequency count. This needs to be handled in a little intelligent way like having some threshold on duplicate value and if it crosses that threshold then it is better to stop accepting value to avoid having everything going to map.
- Iteration. Since this is a compressed representation of dense value, iteration on available value has to be handled in a streaming approach to avoid allocation of huge in memory collection. One of the approach could be having API for consuming single value at a time and let client to decide on what to do with those values, example of such iteration could look something like below
public void consume(IntConsumer consumer) {
IntStream
.range(1, maxValue)
.filter(this::check)
.forEach(consumer::accept);
}
- Range iteration. This data structure is very good for range query.
- Compact set. This is also good DS for set related operations.
ReplyDeletePositive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work.
Best Artificial Intelligence Company in Dubai
Get Nursing Essay Writing Service
ReplyDeleteGet Nursing Essay Writing Service