Today crunching Terabyte of data is very common and it brings up interesting challenges.
Memory is getting cheaper and it is enabling application to keep all the data in memory and process it but some data set are very huge and does not fit in single machine, incase it fits in memory but latency of processing will be not a acceptable.
Lets take a example to understand this
You are building web analytics product that collects all the clicks/page load etc events, each event object will have some of these attributes.
Volume generated by such type of tracking is huge and to ask any interesting question on such data set is resource intensive.
Lets look at some of the resource intensive questions
- How many distinct ip address are present
- How many distinct visitor Id
- Top 10 Pages
- Top X Visitor
- How many times Top X pages are requested
- Any page requested by blacklisted IP or user
Simple way to solve distinct query is to build set and count the items in set
Building set is going to take memory proportional to number of distinct element and set has some load factor to make sure read/write operation are constant time. So memory wise it is super expensive.
Key insight to solve the problem is how accurate distinct count should be , it is OK to have some error(i.e 5 to 10%) and if answer is yes then some magic can done.
Since this problem is about just the count , so we don't have to maintain real value and that will give huge saving, algorithm needs some way to track whether current value is seen or not.
at high level algorithm is
- Compute the hash
- Find the index in bit vector
- Mark the vector if is not set and increment the distinct counter.
It is very simple algorithm and does the job, it is called Linear Counting.
This works well but it has some limitation that you need to workout size which controls accuracy to get best result it should be close to number of distinct values you expect to see.
There is another class of algorithm called LogLog/Hyper LogLog which more powerful than Linear Counting.
Hash value is represented as binary value and then it is split in sub-string
One part is used to extract which bucket value should go
Next part is used to calculating value and it is done by counting no of leading zeros.
By using this technique each entry is put in different buckets for eg values starting with 01- Goes in X, values starting 001 goes in Y bucket.
To compute the final value all the buckets values are taken and it is averaged by applying some factor.
By using this technique billion elements can be counted by using very less memory.
HyperLogLog paper contains all the details of algorithm.
Linear Counting or Hyper Log can be used to answer below questions
- How many distinct ip address are present
- How many distinct visitor Id
- Distinct visitor by region, time etc
- Trending words like google trends
- Near similar document matching to remove duplicate document, it is used by search engine.
This data structure can be used for quick anomaly/fraud detection.
This also can be used for Nested_loop_join , spark is using it in ML lib for MatrixFactorizationModel and few more places to count keys in RDD.
One of the nice properties of these data structure is merge operation.
for e.g. one hyper log can be maintained for unique visitor for each day and they can merged to get value for any time range.
Lets look at some memory usage numbers, it is computed using stream-lib
In this benchmark i used all the words from HarryPotterseries book.
Memory is getting cheaper and it is enabling application to keep all the data in memory and process it but some data set are very huge and does not fit in single machine, incase it fits in memory but latency of processing will be not a acceptable.
Lets take a example to understand this
You are building web analytics product that collects all the clicks/page load etc events, each event object will have some of these attributes.
Volume generated by such type of tracking is huge and to ask any interesting question on such data set is resource intensive.
Lets look at some of the resource intensive questions
- How many distinct ip address are present
- How many distinct visitor Id
- Top 10 Pages
- Top X Visitor
- How many times Top X pages are requested
- Any page requested by blacklisted IP or user
Simple way to solve distinct query is to build set and count the items in set
Building set is going to take memory proportional to number of distinct element and set has some load factor to make sure read/write operation are constant time. So memory wise it is super expensive.
Key insight to solve the problem is how accurate distinct count should be , it is OK to have some error(i.e 5 to 10%) and if answer is yes then some magic can done.
Since this problem is about just the count , so we don't have to maintain real value and that will give huge saving, algorithm needs some way to track whether current value is seen or not.
at high level algorithm is
- Compute the hash
- Find the index in bit vector
- Mark the vector if is not set and increment the distinct counter.
It is very simple algorithm and does the job, it is called Linear Counting.
This works well but it has some limitation that you need to workout size which controls accuracy to get best result it should be close to number of distinct values you expect to see.
There is another class of algorithm called LogLog/Hyper LogLog which more powerful than Linear Counting.
Hash value is represented as binary value and then it is split in sub-string
One part is used to extract which bucket value should go
Next part is used to calculating value and it is done by counting no of leading zeros.
By using this technique each entry is put in different buckets for eg values starting with 01- Goes in X, values starting 001 goes in Y bucket.
To compute the final value all the buckets values are taken and it is averaged by applying some factor.
By using this technique billion elements can be counted by using very less memory.
HyperLogLog paper contains all the details of algorithm.
Linear Counting or Hyper Log can be used to answer below questions
- How many distinct ip address are present
- How many distinct visitor Id
- Distinct visitor by region, time etc
- Trending words like google trends
- Near similar document matching to remove duplicate document, it is used by search engine.
This data structure can be used for quick anomaly/fraud detection.
This also can be used for Nested_loop_join , spark is using it in ML lib for MatrixFactorizationModel and few more places to count keys in RDD.
One of the nice properties of these data structure is merge operation.
for e.g. one hyper log can be maintained for unique visitor for each day and they can merged to get value for any time range.
Lets look at some memory usage numbers, it is computed using stream-lib
In this benchmark i used all the words from HarryPotterseries book.
Actual Count | 11983 | ||
Type | Memory ( KB) | Count | Error % |
Set | 431 | 11983 | 0 |
LinearCounting | 31 | 12167 | 1.5 |
HyperLog | 14 | 12253 | 2.2 |
In less than 1% of memory with 2 % of error this data structure is able to do cool thing !
Another interesting data structure for membership query is bloom filter, i wrote about in mr-cool-bloom-filter blog post.
Code used for bench-marking is available @ DistinctCountingApp.java