Are you ready: August 2015

Thursday, 27 August 2015

Indexes in Embedded database

Embedded database are very important part of application that has micro/mill second latency requirement,.
Most of the embedded database has very good read performance and has specialized types indexes to support fast read.

Reads is all about having index on field and since we are in the era where "ONE SIZE FITS ALL" theory does not work, so different types of indexes are required for different types of read.

Lets gets cracking on few types of index that is required for fast read.

Hash Index
This is very basic type of index that can be implemented using HashMap, so to build such type index just choose the field and use that as key of HashMap and you are done, very good when lookup is going to based on Key.

Unique Index
It is just another variant of hash index but Set is used to maintain distinct values and this can be used to answer queries that needs distinct values or want to check if value exists or not.

Range Index
Now it starts to become little interesting because criteria is based on range of values for e.g
- greaterthan XX and lessthan YY
- greaterthan XX
- lessthan XX

Tree based index comes to rescue , all such type of queries can be answered by B-Tree data structure.

- Starts With or Contains
This is similar to LIKE type of query in database for e.g name like 'A%' or name like '%A%'
Trie data structure can be used to answer such query.

I have tried to build simple library to support above types of query but i have left concurrency aspect to keep solution simple.

Code is available @ github

Sample application has 3 types of index

Hash - For exact value based lookup on some attribute, code available @ HashDataIndex

Tree - This is useful for range queries for eg query on date field, code available @ TreeDataIndex

Contains - This is useful for doing LIKE based search, suffix Tree data structure is used for this, code available @ ContainsDataIndex

Conclusion
- Memory usage is very important factor for such type of API and java collections are not best at it , so for production type of usage memory efficient collection should be used.

- Storing raw data as java object will not give good performance for big JVM, so having compact/memory efficient representation is required as java heap grows.

-Last one that RAM has limit so it is impossible to keep raw data or index data in memory and that makes it interesting problem to solve, so some kind of partition strategy is required for data & index and only required partition of data is kept online in memory, there are many open source library that allows to store data using memory mapped files.