HashCode based data structure are basis of many algorithms, it is used for fast look up , membership queries, group by etc.
HashCode is also used to build some interesting class of data structure. In this blog i will share one of such data structure that has many useful application in real world.
Take a example that you have million or billions of item and you want to keep track of distinct items.
This is very simple problem and can be solved by using HashMap but that requires huge memory depending on number of distinct items because it has to stores actual element also.
To make this problem little interesting we can put memory constraints on solution, so effectively we need space efficient solution to test membership of item.
Nothing comes for free in this world , so with less memory we have to trade off accuracy and in some case it is fine for e.g Unique Visitor on website, Distinct Page visited etc
First intuition to solve this problem by allocating N buckets and mark the bucket based on hashcode, no need to store actual value just mark slot index.
Quick question that comes to mind is what will happen in case of collision and in case of that our result will be wrong but it will be wrong by some percentage only and there are couple of ways improve result.
- Allocate more buckets but use single hash function.
- Use multiple hash functions and each of the hash function has its own dedicated bucket.
Multiple hash function options works very well , using that result can be around 96% correct !
So if we use multiple hash function then data structure will look something like below.
This data structure is called Bloom Filter and it maintains X bit vector and apply X hash function on input value to mark bits.
for checking if value already exists just check if bits are set to true or not by using multiple hash function.
In terms of memory requirement bit vector of X length is required and get more accurate result multiple hash based bit vector can be used. Quick math will give fair idea about memory requirement
Bloom filter is very compact in terms of memory and size can be controlled depending on requirement.
Memory can be further reduced by using compressed bit vector.
It is important to know trade off done to achieve the efficiency .
- It is probabilistic data structure means result are not 100% correct but interesting thing about this data structure is that it can give FALSE POSITIVE but never FALSE NEGATIVE. Which makes it good fit for many practical application
- Multiple hash function are used , so read/write performance will be based on hash function.
- DB Query filter
One of the most common use case. Useful in reducing false query to DB
Yes bloom filter provide alternate way to do joins especially in distributed system. So on one of the dataset(i.e smaller) build bloom filter and send it to other nodes for joining, since it is very compact in memory so transport to remote system does not adds big overhead.
One of the disruptive Big data processing system Spark is using it for joins
- Alternate to B-Tree
B-Tree can be also used for membership queries but if size of B-Tree becomes large enough that it can't fit in the memory then all the benefit is lost.
Another big data system Cassandra is using it for read request by having bloom filter type data-structure on top of data block to avoid IO operation
Another Open source in big data space Parquet is using it for fast filters.
- Partition Search
This is just another variation of above use case, Apache druid.io which is again into big data space for fast analytic is using it by partitioning data on ingestion by time and in each partition column has bloom filter type of index for fast filters.
- Safe Website filter
This is an interesting one, google chrome uses bloom filter to find if site is safe or not.
Simple implementation of Bloom Filter is available @ github
In this implementation each hash function has independent bit vector , but single common vector can be used for all the hash function.