Are you ready: Algorithms

Showing posts with label Algorithms. Show all posts

Sunday, 13 December 2020

Sorting under constraint algorithms

In the compiler world code inlining is the most important optimization that enables other optimization like dead code elimination, pre fetching , out of order execution etc.

Similarly sorting also enables many optimizations like binary search , streaming aggregation, range search operations, prefix compression, run length encoding , delta encoding, understanding trends, posting list, data partition etc.

Sorting is memory and compute intensive work and many times we don't have enough compute/memory to do it.

In this post I will share 2 sorting algorithms that can be used when a system has memory or CPU constraint.

Disk based sorting

We have file containing 1 TB data and we want to sort it. Data is huge to it is not possible to use standard in-memory sorting algorithm for this.

One way to handle sorting of such data is split it in chunks, sort the chunk in memory, persist chunk to disk and finally merge the sorted chunk using k way merge algorithm.

At high level sort pipeline will look something like below

Nice thing about this algorithm is that it is Embarrassingly_parallel.

This algorithm is also good example of Divide-and-conquer_algorithm and this technique can be applied both the stages.

This algorithm has 2 stages in the pipeline that can be executed in parallel to take advance of multiple cores.

Lets assume that input file contains 10 Million then it can be decomposed in Split stage

In merge stage we have to do reverse operations of taking multiple files and creating single one.

Split & Merge has different types of compute/disk requirement and it is possible to make both the stage parallel or just one based on constraint.

Overall sort pipeline will look like below.

This algorithm is used by many databases to manage result sets that can't be fit into memory.

Important logic in this algorithm is K-Way merge of sorted results. If K is 2 then it is straight forward.

2 Way merge

Merge process is pick head from both iterators and add the value that is less, move pointer of iterator whose value was added.

Need to handle some edge conditions to avoid buffer overflow while reading and handling iterators of different size.


v1.next();
v2.next();

while (v1.hasNext() && v2.hasNext()) {
    value1 = v1.value();
    value2 = v2.value();
    if (isLessThan(value1, value2)) {
        buffer.add(value1);
        v1.next();
    } else {
        buffer.add(value2);
        v2.next();
    }
}

if (v1.hasNext()) {
    append(buffer, v1);
} else if (v2.hasNext()) {
    append(buffer, v2);
}

K Way merge

Assume K is 4, then one way to merge is to split the whole list in pairs of 2, keep merging in pairs and finally start merge out of 2 way merges. This is a good algorithm but can't take advantage of batching multiple iterators.

Recommended way is to use Heap of K values. This is more efficient as we can process multiple inputs in a single pass and can reduce IO overhead also.

PriorityQueue<LineIterator> heap=...
LineIterator itr;

while ((itr = heap.poll()) != null) {
    write(writer, itr.value());
    itr.next();
    if (itr.hasNext()) {
        heap.add(itr);
    }
}

BitMap Sorting

Bitmap is a powerful data structure for searching and has some interesting properties for sort also.

Consider a scenario where the file contains n positive integer and each value is less than K.

K can be really huge depending on max value, to put this in context just by using 256 MB memory billions of int values can be sorted.

Idea is based around an allocated array with every element of K word (i.e 32 or 64). If we used 32 bit words then 32 values can be stored in every slot. Total capacity of this data structure is 32 * len(array).

Setting bit needs 2 information, slot in array and position in that slot.

Bit fiddling enables to pack multiple values in a single word, you want to read more on bit fiddling then refer to bit-fiddling.

In this example bytes is 4 and word size is 32.

Checking for value is straightforward and it involves doing bit wise & on Slot value.

Complete working code

public static final int NO_OF_BYTE = 4;
private final int WORD_SIZE = 8 * NO_OF_BYTE;
private final int SLOT_SHIFT = NO_OF_BYTE + 1;
private final int[] values;
private final int maxValue;

public BitMapSort(int maxValue) {
    this.maxValue = maxValue;
    this.values = new int[1 + maxValue / WORD_SIZE];
    logUsageInfo(maxValue);
}


public void set(int v) {
    this.values[slot(v)] |= position(v);
}

public boolean check(int v) {
    int value = this.values[slot(v)] & position(v);
    return value != 0;
}

public int position(int v) {
    return 1 << (v & WORD_SIZE - 1);
}

public int slot(int v) {
    return v >> SLOT_SHIFT;
}

Whole pipeline will look something like below

Trade Off

Nothing is perfect and this also has some constraints and it is good to be aware of it.

- This is a dense value data structure, so if we have to store a value that is of 100 Million then we have to allocate at least 100 million bits ( 95 MB). If values are sparse then find alternate data structure.

- Thread safety has to be handled at slot level because 32 values are packed in a single slot.

- Values should be distinct but if duplicate values are present and it is going to be less duplicates then additional data structures like maps can be used to keep frequency count. This needs to be handled in a little intelligent way like having some threshold on duplicate value and if it crosses that threshold then it is better to stop accepting value to avoid having everything going to map.

- Iteration. Since this is a compressed representation of dense value, iteration on available value has to be handled in a streaming approach to avoid allocation of huge in memory collection. One of the approach could be having API for consuming single value at a time and let client to decide on what to do with those values, example of such iteration could look something like below

public void consume(IntConsumer consumer) {
    IntStream
            .range(1, maxValue)
            .filter(this::check)
            .forEach(consumer::accept);
}

- Range iteration. This data structure is very good for range query.

- Compact set. This is also good DS for set related operations.

Conclusion

These are simple and yet very powerful algorithms and if this fits the bill then it can be the difference between solving the problem or not solving at all.

Tuesday, 20 October 2020

Histogram is not Instagram

Histogram is very popular technique to represent distribution of numerical data. This representation can be used for various thing like measuring latency of request, cardinality of data , grouping data in bins , frequency density of values.

In this post i will share some of the common use case with histogram.

Measuring latency

Assume that we want to measure response time of API call. First thing that comes to mind is stopwatch! Code snippet will look something like this.

StopWatch stopWatch = new StopWatch();
stopWatch.start();

//Some time consuming task

stopWatch.stop();

System.out.println(stopWatch.toString());

This is good starting point but does not take any further :-) For serious measurement we need more data points like best response time , worst , response time at some percentile like how much 90 % of request are taking. Such type of summary will allow us to understand service latency in more details also enable to take better decision. If data is available at different percentile then it can be plotted like below to understand it better.

For some consumer facing service, slow response means bad user experience and loss of revenue. Some of the industry like ecommerce, video streaming companies tracking response time proactively to retain customers.

Data cardinality

Tracking cardinality of some dimension is very important in data intensive application. It can be used for identifying skew in data or for finding optimal plan for execution of read or write request. Once of such examples is Index statistics that is maintained by databases. Database cost based optimizer will use index stats to decided which index will result in faster execution.

For time series database tracking distribution of data based on day of time will be useful to find peek/busy time and it can used that information for having custom partition logic to manage load. Naïve way of doing this will be using frequency map but it will soon run into various performance issue and also limit options to do range query.

Histogram will come very handy for such scenario. Below charts shows distribution by time.

Frequency density

This is popular use case in data science community. For example e-commerce company will be interested in know spend bins of customer or restaurant will be interested in knowing about frequency of visits.

These metrics can help business in know about wealthy customers and come up with better strategy to retain or get new customers. Below example from Investopedia contains people density by age, very useful information for portfolio manager.

Histogram has many more interesting use case. One of the cool property about histogram is that it is Commutative. Items order makes no different in final output due to which it is possible to build histograms in parallel and merge it later.

Another important property of histogram is cumulative, this property allows to do cumulative calculation by merging multiple bins/buckets.

Code Snippet

Now lets look at one of the histogram library that can be used for various use case.

HdrHistogram by Gil Tene is battel tested library build for measuring latency of application. I found this library to be useful for other use case also. This library is very memory efficient and designed to work in low latency environment and also has very compact loss less serialization.

Latency Measurement

Histogram histogram = new Histogram(2);
range(0, 1000_000)
        .forEach(v -> {
            long timeTaken = doSomething(v);
            histogram.recordValue(timeTaken);
        });
histogram.outputPercentileDistribution(System.out, 1.0);

Histogram class is part of HdrHistogram lib, it accepts various parameter. In this example we are specifying precision to use. This value can range in 0 to 5, precision has direct impact on size of histogram.

Another important thing about this library is that histogram will auto resize based on input value and this property plays big role in avoiding upfront allocation.

Running above code will produce output like below

Lets try to understand output. Output contains 4 columns ( Value , Percentile, Total Count, 1/1-Percentile)

Value - Input value for given percentile

Percentile - % of records in this group/bucket.

Total Count - Count of records in bucket.

Lets pick some bucket values to understand how to use it.

50% request took 5 Seconds - This metrics is very common when measuring but of no use because it is like average/mean. If you want to do some serious measurement then stay away from average.

97.5% request took 9.7 seconds - This is where it becomes interesting, 25% of request is taking double! It is really bad user experience.

99.86% request took 10+ seconds - This provides real worst case, around 200K+ request is part of this group and it is takes double time as compared to average.

Having latency broken down by percentile provide great insight into what is real customer experience. If you are e-commerce shop then every micro second counts and can translate into big gain or loss.

Trading application also sensitive to latency spike and makes big difference on how fast buy/sell trades can be matched.

Another thing that is very common in micro services world is that to render full page 100s of API call is required and it adds risk that atleast one of the API call to get hit by worst case latency, so overall user experience become bad:-(

Next time anyone talks about latency then ask for 99.9999 percentile to know about real impact.

Data cardinality

Database has optimizers like rule based and cost based. Cost based optimizer generate plan based on cost of operations like Join, filter etc. and select best plan quickly.

One of the important factor in deciding which plan to use is based on data distribution/stats. Each index maintains data cardinality stats so that optimizer can quickly figure out which index to use.

Cardinality is also useful to find data skew and to come up with plan to handle data skew. Distributed computing framework like Apache Spark suffer from skew and it depends on engineer to add some intelligence in code to avoid Spark job failure.

Trust me real data is always Skewed.

Lets take a scenario that an E-commerce Giant want to keep some statistics like purchase volume at merchant level, so that it can come up better platform scaling options.

Histogram internally maintains frequency count, so it can be used to calculate total volume at merchant level in efficient way.

For this scenario lets assign unique integer id to each merchant and record purchase in histogram, for recoding purchase we can pass merchant id in histogram.

Some code on how this can be done.

Histogram histogram = new Histogram(2);
range(0, 1000_000 * 100)
        .forEach(v -> {
            int merchant = merchant();
            histogram.recordValue(merchant);
        });
histogram.outputPercentileDistribution(System.out, 1.0);

In above example merchant id from 100 Million transaction are used to build histogram. To show some real skew i have added 50% weight to one particular merchant (i.e merchant id 101).

Running above code will produce something like below

Value Percentile TotalCount 1/(1-Percentile)

1.00 0.000000000000 1043 1.00

101.00 0.100000000000 50050299 1.11

101.00 0.200000000000 50050299 1.25

101.00 0.300000000000 50050299 1.43

101.00 0.400000000000 50050299 1.67

101.00 0.500000000000 50050299 2.00

10047.00 0.550000000000 55023464 2.22

20095.00 0.600000000000 60047899 2.50

30079.00 0.650000000000 65038139 2.86

40191.00 0.700000000000 70092127 3.33

50175.00 0.750000000000 75086627 4.00

55039.00 0.775000000000 77520056 4.44

50% of transactions (i.e 50 Million) are part of 101 merchant. This information can be used for optimization of read query or write query or coming up with some good product offerings.

One thing to remember that histograms can be merged, so it is possible to calculate histogram using multiple nodes and later merge it to get full view, infact it is possible to build histogram incrementally.

Next time if data skew is troubling you then try Histogram.

Frequency Density

This use case takes histogram to next level. Lets extend E-commerce example and now we want to rank customers based on total dollar value they spend.

Histogram comes handy, we have to just feed customer wise total spend and get the buckets that can used for ranking customer.

Histogram histogram = new Histogram(2);
range(0, 1000_000 * 100)
        .forEach(v -> {
            long spend = spendValue();
            histogram.recordValue(spend);
        });
histogram.outputPercentileDistribution(System.out, 1.0);

Again taking 100 million purchase and this time adding spend value, this will create buckets based on spend and later customers can be ranked using these buckets.

Buckets output will look something like below.

Value Percentile TotalCount 1/(1-Percentile)

0.00 0.000000000000 4963 1.00

2007.00 0.100000000000 10036946 1.11

4015.00 0.200000000000 20077808 1.25

6015.00 0.300000000000 30075647 1.43

8031.00 0.400000000000 40157837 1.67

10047.00 0.500000000000 50238571 2.00

11007.00 0.550000000000 55038969 2.22

12031.00 0.600000000000 60159541 2.50

13055.00 0.650000000000 65279759 2.86

14015.00 0.700000000000 70077894 3.33

This output can be interpreted like 10 million customer spends in range of 0 to 2K, another 10 million spends 2K to 4K and so forth. Higher bucket will contain premium customers.

Histogram has many more application, it can be used in machine learning to build features because they allow various vector based arithmetic like add/substract.

All the code used in this post is available @ measure github

Wednesday, 6 April 2016

Preparation of Coding Interview

I want to share my coding interview experience, so i have decided to write blog series about questions that are asked and this is first one in that series.

Coding interview are very common these days and if you find company that does not have it as part of selection process then i think you should stay away from them.

Lets start with why coding interview is required and reason is very simple especially in times when every company is adopting agile and they don't want to wait for months after hiring to know about coding skills of developer.

So that is the reason why it is the first gate that you have to pass as developer.

Jeff Atwood share interesting experience about why-cant-programmers-program and type of question that is asked to filter out candidate. FizzBuzz is very good question to start with.

I would advise to try FizzBuzz to make sure you are not surprised and think about testing also because we are in world of TDD.

What types of question are asked

Different types of question are asked in coding interview and it depends on type of company you apply

- Algorithms & Data structures

This is very common for Silicon valley company or product development company and some serious preparations is required for these types of interview if your daily work involves only playing with frameworks(i.e hibernate/spring etc).

Things become more difficult for such interview because may developers(like me) are focused so much on learning emerging framework to build bleeding edge Resume that we tend to forget basic algorithm and start tagging ourself as some XYZ framework developer and company also help to the cause by putting such job requirement.

- Simple problems using TDD

This type of questions is also very common and it is make sure that you don't leave testing to QA team and write a code that is defect free , easy to understand and maintainable.

- Refactoring problems

You don't work on green field project every day and most of the time is spent in reading others code and adding features to it, so employer want to know "how do you quickly understand code written by somebody else and add new feature to it".

TDD becomes more important in such questions because it can be used to build all the safety net around existing code before you start changing it.

Above question can be asked as pair programming question or take a home type of question, so you have to build solution based on time given to you.

Some of the things employer wants to check are

- Simplicity/Readability of code.
- Maintainability of code.
- Design principal usage.

Bottom line is you have to prepare for coding interview to survive in fast changing market.

Stay tuned up for questions.

Thursday, 23 July 2015

Efficiency with Algorithms

Recently had look at excellent talk on Efficiency with Algorithms, Performance with Data Structures , this talk has really some good content on performance.

In this blog i will share some of ideas from above talk and few things that i have learned.

Pre processing
This is very common trick, this is trade off between processing required when actual requests comes vs time taken to pre compute some thing, some of the good examples are.

IndexOf
This is very common string operation, almost all application needs this algorithm, java have brute force algorithm to solve this but this can become bottle neck very soon , so it is better to use Knuth–Morris–Pratt_algorithm algorithm, which pre computes table to get better performance.

Search
Sequential search vs binary search is trade off between search time or pre processing (i.e sorting) time, it starts to give return if number of compare required goes over 0(log n)

Binary search is heart of so many algorithm, this reduces expensive search operation.

Many String permutation search problems are solve using this.

Hashmap in java has nice optimization when many keys with same hashcode are added, to ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties

Indexes
Lookup is expensive operation and binary search can't answer all types of query,so you should build specialized index for fast search, some of the options are key/value, prefix/suffix , bitmap index, inverted index etc

Pre Allocate
This is classic one, so if you know how big your data structure will be then it is better to pre allocate it rather than keep it expanding multiple times and incur cost of allocation & copy.

Array based data structure doubles capacity every time re-allocation is required and this is done to amortized allocation cost, so they do big and less frequent allocation, in java world ArrayList, HashMap etc are good example of that.

Many design pattern like Object Pooling/Fly Weight etc are based on this.

Reduce Expensive/Duplicate operation
Each algorithm has some expensive operation and to make it efficient those expensive operation should be reduced, some of the common example are

HashCode
Hash code computation is expensive operation, so this should be not be computed multiple times for given key. One way is compute it once and pass it to other functions , so recomputation is not required or pre-compute it for eg String class in java pre compute hash code to save some time.

Modulus /Divide
This is one of the expensive arithmetic operation, bit map operation can be used to do same thing but at lower cost.
If data structure is circular for eg Circular array and want to put value in free slot then Modulus operation is required and if capacity of array is power of 2 then bit wise operation ( i.e index & ( capacity-1) ) can be used to get Modulus value and this will give significant performance gain at the cost of extra memory. This technique is used by HashMap in java

Similarly right shift ( >>) operation can be used for divide to get better performance, but now a days compiler are smart, so you get this one for free, no need to write it.

Reduce copy overhead
Array based data structure amortized re-sizing cost by increasing capacity by 2 times , this is good but overhead of copy also comes along with it, another approach is chain of arrays, so this way you only allocate one big chunk but don't have to do copy of old value, just add this new block of memory to chain of allocated blocks.
gs-collection has CompositeFastList which is build using this approach.

Batching
Expensive operation like write to file/network/database etc should be batched to get best performance.

Co-locate data
Keeping all required data together gives very good performance based on how processor works, but this is one thing that is broken in many application.
Mike Acton talk about this in Data-Oriented Design and C++ talk in details.

Column based storage are very good analytic/aggregation use case because most of the time single column data for all rows are required to answer analytic/aggregation request.

Linear probing hash tables are also good example of data co-location.

Bit Packing is another option to keep required data very close. but should be used with extra caution because this can make code complex.

Unnecessary Work
Some of algorithm suffer from this problem most common are recursive algorithm like factorial or Fibonacci etc, both of this has duplicate work problem, which can be fixed by Memoization technique.

Simplicity & readability of code should not be lost during efficiency ride because next guy has to maintain it :-)