Tuesday, 20 October 2020

Histogram is not Instagram

Histogram is very popular technique to represent distribution of numerical data. This representation can be used for various thing like measuring latency of request, cardinality of data , grouping data in bins , frequency density of values.







In this post i will share some of the common use case with histogram.

Measuring latency

Assume that we want to measure response time of API call. First thing that comes to mind is stopwatch! Code snippet will look something like this. 

StopWatch stopWatch = new StopWatch();
stopWatch.start();

//Some time consuming task

stopWatch.stop();

System.out.println(stopWatch.toString());

This is good starting point but does not take any further :-) For serious measurement we need more data points like best response time , worst  , response time at some percentile like how much 90 % of request are taking. Such type of summary will allow us to understand service latency in more details also enable to take better decision. If data is available at different percentile then it can be plotted like below to understand it better.


For some consumer facing service, slow response means bad user experience and loss of revenue. Some of the industry like ecommerce, video streaming companies tracking response time proactively to retain customers.


Data cardinality

Tracking cardinality of some dimension is very important in data intensive application. It can be used for identifying skew in data or for finding optimal plan for execution of read or write request. Once of such examples is Index statistics that is maintained by databases. Database cost based optimizer will use index stats to decided which index will result in faster execution. 

For time series database tracking distribution of data based on day of time will be useful to find peek/busy time and it can used that information for having custom partition logic to manage load. Naïve way of doing this will be using frequency map but it will soon run into various performance issue and also limit options to do range query. 

Histogram will come very handy for such scenario. Below charts shows distribution by time.



Frequency density

This is popular use case in data science community. For example e-commerce company will be interested in know spend bins of customer or restaurant will be interested in knowing about frequency of visits. 

These metrics can help business in know about wealthy customers and come up with better strategy to retain or get new customers. Below example from Investopedia  contains people density by age, very useful information for portfolio manager.



Histogram has many more interesting use case. One of the cool property about histogram is that it is Commutative. Items order makes no different in final output due to which it is possible to build histograms in parallel and merge it later.

Another important property of histogram is cumulative, this property allows to do cumulative calculation by merging multiple bins/buckets.


Code Snippet 

Now lets look at one of the histogram library that can be used for various use case.

HdrHistogram by Gil Tene is battel tested library build for measuring latency of application. I found this library to be useful for other use case also. This library is very memory efficient and designed to work in low latency environment and also has very compact loss less serialization.

Latency Measurement

Histogram histogram = new Histogram(2);
range(0, 1000_000)
.forEach(v -> {
long timeTaken = doSomething(v);
histogram.recordValue(timeTaken);
});
histogram.outputPercentileDistribution(System.out, 1.0);
Histogram class is part of HdrHistogram lib, it accepts various parameter. In this example we are specifying precision to use. This value can range in 0 to 5, precision has direct impact on size of histogram.

Another important thing about this library is that histogram will auto resize based on input value and this property plays big role in avoiding upfront allocation.
Running above code will produce output like below 

Value Percentile TotalCount 1/(1-Percentile)
1.00 0.000000000000 104 1.00
999.00 0.100000000000 100120 1.11
1999.00 0.200000000000 200638 1.25
3007.00 0.300000000000 301219 1.43
3999.00 0.400000000000 400228 1.67
5023.00 0.500000000000 502647 2.00
5503.00 0.550000000000 550255 2.22
6015.00 0.600000000000 601367 2.50
6527.00 0.650000000000 652557 2.86
7007.00 0.700000000000 700720 3.33
7519.00 0.750000000000 751827 4.00
7775.00 0.775000000000 777497 4.44
9023.00 0.900000000000 902416 10.00
9151.00 0.912500000000 915223 11.43
9279.00 0.925000000000 928040 13.33
9407.00 0.937500000000 940782 16.00
9471.00 0.943750000000 947175 17.78
9535.00 0.950000000000 953492 20.00
9727.00 0.971875000000 972748 35.56
9791.00 0.975000000000 979157 40.00
9791.00 0.978125000000 979157 45.71
9919.00 0.990625000000 992103 106.67
9983.00 0.992187500000 998439 128.00
9983.00 0.992968750000 998439 142.22
9983.00 0.993750000000 998439 160.00
9983.00 0.994531250000 998439 182.86
9983.00 0.995312500000 998439 213.33
9983.00 0.996093750000 998439 256.00
9983.00 0.996484375000 998439 284.44
9983.00 0.996875000000 998439 320.00
9983.00 0.997265625000 998439 365.71
9983.00 0.997656250000 998439 426.67
9983.00 0.998046875000 998439 512.00
9983.00 0.998242187500 998439 568.89
9983.00 0.998437500000 998439 640.00
10047.00 0.998632812500 1000000 731.43
10047.00 1.000000000000 1000000
#[Mean = 4999.32, StdDeviation = 2888.20]
#[Max = 10047.00, Total count = 1000000]
#[Buckets = 7, SubBuckets = 256]
view raw Histogram.out hosted with ❤ by GitHub

Lets try to understand output. Output contains 4 columns ( Value , Percentile, Total Count, 1/1-Percentile)

Value - Input value for given percentile 
Percentile - % of records in this group/bucket.
Total Count - Count of records in bucket.

Lets pick some bucket values to understand how to use it.

Value Percentile TotalCount 1/(1-Percentile
5023.00 0.500000000000 502647 2.00
9023.00 0.900000000000 902416 10.00
9791.00 0.975000000000 979157 40.00
9919.00 0.990625000000 992103 106.67
9983.00 0.998046875000 998439 512.00
10047.00 0.998632812500 1000000 731.43
view raw SampleHisto.txt hosted with ❤ by GitHub

50% request took 5 Seconds - This metrics is very common when measuring but of no use because it is like average/mean. If you want to do some serious measurement then stay away from average.

97.5% request took 9.7 seconds - This is where it becomes interesting, 25% of request is taking double! It is really bad user experience. 

99.86% request took 10+ seconds - This provides real worst case, around 200K+ request is part of this group and it is takes double time as compared to average. 

Having latency broken down by percentile provide great insight into what is real customer experience. If you are e-commerce shop then every micro second counts and can translate into big gain or loss. 
Trading application also sensitive to latency spike and makes big difference on how fast buy/sell trades can be matched.

Another thing that is very common in micro services world is that to render full page 100s of API call is required and it adds risk that atleast one of the API call to get hit by worst case latency, so overall user experience become bad:-(

Next time anyone talks about latency then ask for 99.9999 percentile to know about real impact.
 

Data cardinality

Database has optimizers like rule based and cost based. Cost based optimizer generate plan based on cost of operations like Join, filter etc. and select best plan quickly. 
One of the important factor in deciding which plan to use is based on data distribution/stats. Each index maintains data cardinality stats so that optimizer can quickly figure out which index to use. 

Cardinality is also useful to find data skew and to come up with plan to handle data skew. Distributed computing framework like Apache Spark suffer from skew and it depends on engineer to add some intelligence in code to avoid Spark job failure.

Trust me real data is always Skewed. 

Lets take a scenario that an E-commerce Giant want to keep some statistics like purchase volume at merchant level, so that it can come up better platform scaling options.

Histogram internally maintains frequency count, so it can be used to calculate total volume at merchant level in efficient way.
For this scenario lets assign unique integer id to each merchant and record purchase in histogram, for recoding purchase we can pass merchant id in histogram.

Some code on how this can be done. 

Histogram histogram = new Histogram(2);
range(0, 1000_000 * 100)
.forEach(v -> {
int merchant = merchant();
histogram.recordValue(merchant);
});
histogram.outputPercentileDistribution(System.out, 1.0);
In above example merchant id from 100 Million transaction are used to build histogram. To show some real skew i have added 50% weight to one particular merchant (i.e merchant id 101).

Running above code will produce something like below

 Value     Percentile TotalCount 1/(1-Percentile)

        1.00 0.000000000000       1043           1.00
      101.00 0.100000000000   50050299           1.11
      101.00 0.200000000000   50050299           1.25
      101.00 0.300000000000   50050299           1.43
      101.00 0.400000000000   50050299           1.67
      101.00 0.500000000000   50050299           2.00
    10047.00 0.550000000000   55023464           2.22
    20095.00 0.600000000000   60047899           2.50
    30079.00 0.650000000000   65038139           2.86
    40191.00 0.700000000000   70092127           3.33
    50175.00 0.750000000000   75086627           4.00
    55039.00 0.775000000000   77520056           4.44
 
50% of transactions (i.e 50 Million) are part of 101 merchant. This information can be used for optimization of read query or write query or coming up with some good product offerings.

One thing to remember that histograms can be merged, so it is possible to calculate histogram using multiple nodes and later merge it to get full view, infact it is possible to build histogram incrementally. 

Next time if data skew is troubling you then try Histogram.

Frequency Density 

This use case takes histogram to next level. Lets extend E-commerce example and now we want to rank customers based on total dollar value they spend.

Histogram comes handy, we have to just feed customer wise total spend and get the buckets that can used for ranking customer. 

Histogram histogram = new Histogram(2);
range(0, 1000_000 * 100)
.forEach(v -> {
long spend = spendValue();
histogram.recordValue(spend);
});
histogram.outputPercentileDistribution(System.out, 1.0);


Again taking 100 million purchase and this time adding spend value, this will create buckets based on spend and later customers can be ranked using these buckets.

Buckets output will look something like below.

 Value     Percentile TotalCount 1/(1-Percentile)

        0.00 0.000000000000       4963           1.00
     2007.00 0.100000000000   10036946           1.11
     4015.00 0.200000000000   20077808           1.25
     6015.00 0.300000000000   30075647           1.43
     8031.00 0.400000000000   40157837           1.67
    10047.00 0.500000000000   50238571           2.00
    11007.00 0.550000000000   55038969           2.22
    12031.00 0.600000000000   60159541           2.50
    13055.00 0.650000000000   65279759           2.86
    14015.00 0.700000000000   70077894           3.33
 
This output can be interpreted like 10 million customer spends in range of 0 to 2K, another 10 million spends 2K to 4K and so forth. Higher bucket will contain premium customers.    

Histogram has many more application, it can be used in machine learning to build features because they allow various vector based arithmetic like add/substract. 

All the code used in this post is available @ measure github