Tuesday, 20 October 2020

Histogram is not Instagram

Histogram is very popular technique to represent distribution of numerical data. This representation can be used for various thing like measuring latency of request, cardinality of data , grouping data in bins , frequency density of values.







In this post i will share some of the common use case with histogram.

Measuring latency

Assume that we want to measure response time of API call. First thing that comes to mind is stopwatch! Code snippet will look something like this. 

StopWatch stopWatch = new StopWatch();
stopWatch.start();

//Some time consuming task

stopWatch.stop();

System.out.println(stopWatch.toString());

This is good starting point but does not take any further :-) For serious measurement we need more data points like best response time , worst  , response time at some percentile like how much 90 % of request are taking. Such type of summary will allow us to understand service latency in more details also enable to take better decision. If data is available at different percentile then it can be plotted like below to understand it better.


For some consumer facing service, slow response means bad user experience and loss of revenue. Some of the industry like ecommerce, video streaming companies tracking response time proactively to retain customers.


Data cardinality

Tracking cardinality of some dimension is very important in data intensive application. It can be used for identifying skew in data or for finding optimal plan for execution of read or write request. Once of such examples is Index statistics that is maintained by databases. Database cost based optimizer will use index stats to decided which index will result in faster execution. 

For time series database tracking distribution of data based on day of time will be useful to find peek/busy time and it can used that information for having custom partition logic to manage load. Na├»ve way of doing this will be using frequency map but it will soon run into various performance issue and also limit options to do range query. 

Histogram will come very handy for such scenario. Below charts shows distribution by time.



Frequency density

This is popular use case in data science community. For example e-commerce company will be interested in know spend bins of customer or restaurant will be interested in knowing about frequency of visits. 

These metrics can help business in know about wealthy customers and come up with better strategy to retain or get new customers. Below example from Investopedia  contains people density by age, very useful information for portfolio manager.



Histogram has many more interesting use case. One of the cool property about histogram is that it is Commutative. Items order makes no different in final output due to which it is possible to build histograms in parallel and merge it later.

Another important property of histogram is cumulative, this property allows to do cumulative calculation by merging multiple bins/buckets.


Code Snippet 

Now lets look at one of the histogram library that can be used for various use case.

HdrHistogram by Gil Tene is battel tested library build for measuring latency of application. I found this library to be useful for other use case also. This library is very memory efficient and designed to work in low latency environment and also has very compact loss less serialization.

Latency Measurement

Histogram histogram = new Histogram(2);
range(0, 1000_000)
.forEach(v -> {
long timeTaken = doSomething(v);
histogram.recordValue(timeTaken);
});
histogram.outputPercentileDistribution(System.out, 1.0);
Histogram class is part of HdrHistogram lib, it accepts various parameter. In this example we are specifying precision to use. This value can range in 0 to 5, precision has direct impact on size of histogram.

Another important thing about this library is that histogram will auto resize based on input value and this property plays big role in avoiding upfront allocation.
Running above code will produce output like below 


Lets try to understand output. Output contains 4 columns ( Value , Percentile, Total Count, 1/1-Percentile)

Value - Input value for given percentile 
Percentile - % of records in this group/bucket.
Total Count - Count of records in bucket.

Lets pick some bucket values to understand how to use it.


50% request took 5 Seconds - This metrics is very common when measuring but of no use because it is like average/mean. If you want to do some serious measurement then stay away from average.

97.5% request took 9.7 seconds - This is where it becomes interesting, 25% of request is taking double! It is really bad user experience. 

99.86% request took 10+ seconds - This provides real worst case, around 200K+ request is part of this group and it is takes double time as compared to average. 

Having latency broken down by percentile provide great insight into what is real customer experience. If you are e-commerce shop then every micro second counts and can translate into big gain or loss. 
Trading application also sensitive to latency spike and makes big difference on how fast buy/sell trades can be matched.

Another thing that is very common in micro services world is that to render full page 100s of API call is required and it adds risk that atleast one of the API call to get hit by worst case latency, so overall user experience become bad:-(

Next time anyone talks about latency then ask for 99.9999 percentile to know about real impact.
 

Data cardinality

Database has optimizers like rule based and cost based. Cost based optimizer generate plan based on cost of operations like Join, filter etc. and select best plan quickly. 
One of the important factor in deciding which plan to use is based on data distribution/stats. Each index maintains data cardinality stats so that optimizer can quickly figure out which index to use. 

Cardinality is also useful to find data skew and to come up with plan to handle data skew. Distributed computing framework like Apache Spark suffer from skew and it depends on engineer to add some intelligence in code to avoid Spark job failure.

Trust me real data is always Skewed. 

Lets take a scenario that an E-commerce Giant want to keep some statistics like purchase volume at merchant level, so that it can come up better platform scaling options.

Histogram internally maintains frequency count, so it can be used to calculate total volume at merchant level in efficient way.
For this scenario lets assign unique integer id to each merchant and record purchase in histogram, for recoding purchase we can pass merchant id in histogram.

Some code on how this can be done. 

Histogram histogram = new Histogram(2);
range(0, 1000_000 * 100)
.forEach(v -> {
int merchant = merchant();
histogram.recordValue(merchant);
});
histogram.outputPercentileDistribution(System.out, 1.0);
In above example merchant id from 100 Million transaction are used to build histogram. To show some real skew i have added 50% weight to one particular merchant (i.e merchant id 101).

Running above code will produce something like below

 Value     Percentile TotalCount 1/(1-Percentile)

        1.00 0.000000000000       1043           1.00
      101.00 0.100000000000   50050299           1.11
      101.00 0.200000000000   50050299           1.25
      101.00 0.300000000000   50050299           1.43
      101.00 0.400000000000   50050299           1.67
      101.00 0.500000000000   50050299           2.00
    10047.00 0.550000000000   55023464           2.22
    20095.00 0.600000000000   60047899           2.50
    30079.00 0.650000000000   65038139           2.86
    40191.00 0.700000000000   70092127           3.33
    50175.00 0.750000000000   75086627           4.00
    55039.00 0.775000000000   77520056           4.44
 
50% of transactions (i.e 50 Million) are part of 101 merchant. This information can be used for optimization of read query or write query or coming up with some good product offerings.

One thing to remember that histograms can be merged, so it is possible to calculate histogram using multiple nodes and later merge it to get full view, infact it is possible to build histogram incrementally. 

Next time if data skew is troubling you then try Histogram.

Frequency Density 

This use case takes histogram to next level. Lets extend E-commerce example and now we want to rank customers based on total dollar value they spend.

Histogram comes handy, we have to just feed customer wise total spend and get the buckets that can used for ranking customer. 

Histogram histogram = new Histogram(2);
range(0, 1000_000 * 100)
.forEach(v -> {
long spend = spendValue();
histogram.recordValue(spend);
});
histogram.outputPercentileDistribution(System.out, 1.0);


Again taking 100 million purchase and this time adding spend value, this will create buckets based on spend and later customers can be ranked using these buckets.

Buckets output will look something like below.

 Value     Percentile TotalCount 1/(1-Percentile)

        0.00 0.000000000000       4963           1.00
     2007.00 0.100000000000   10036946           1.11
     4015.00 0.200000000000   20077808           1.25
     6015.00 0.300000000000   30075647           1.43
     8031.00 0.400000000000   40157837           1.67
    10047.00 0.500000000000   50238571           2.00
    11007.00 0.550000000000   55038969           2.22
    12031.00 0.600000000000   60159541           2.50
    13055.00 0.650000000000   65279759           2.86
    14015.00 0.700000000000   70077894           3.33
 
This output can be interpreted like 10 million customer spends in range of 0 to 2K, another 10 million spends 2K to 4K and so forth. Higher bucket will contain premium customers.    

Histogram has many more application, it can be used in machine learning to build features because they allow various vector based arithmetic like add/substract. 

All the code used in this post is available @ measure github

No comments:

Post a Comment