Tuesday, 8 December 2020

Disk storage algorithm

 This is follow up post on rethinking-key-value-store article to explore storage part of system.

Many data systems support plugin based storage layers and it opens a whole set of options to use one from a shelf or build one that suits your needs.

In this post i will share how a new storage system can be built and later it is used for building time series application on top of it.

Before we go deep in disk based algorithm, let's look at why non-volatile storage is required.

In today times when machines with Terabytes RAM are available why do we have to bother to store stuff on disk ? 

Couple of good reasons why still having good non-volatile storage manager makes sense today.

  • It is cheap

Disk is very cheap as compared RAM, so it does not make sense to store data in expensive store especially when data is not being used frequently. Lots of cloud provider bill can saved! 

  • It is unlimited

Although a machine with big RAM is available but it is still limited , it will not continue get bigger at the same rate as in the past and if we want applications to have the illusion that it has unlimited memory then flushing to disk is a must.  

  • Allow fast restarts
Think what will happen if application crash ? Application has to rebuild the whole state and it could take very long time before application is available again to serve request. Saving computed data to disk and restoring from it will be way faster.

  • Information exchange 

How do two application running on different machine can communicate ? For inter application communication in-memory data has to written in wire format so that it can be sent over network.

and many more reasons..

Application has volatile & non-volatile area and storage manager sits in middle of that (.ie RAM and Disk) and provide efficient access to data.

RAM and DISK are very different types of hardware and access patterns are also very different.

On one hand RAM can be accessed randomly and it is fast for both read/write.

Disks are accessed sequentially using blocks and very slow write and slow read, SSD has improved the access time but sequential access is the recommended to get best performance.

Storage managers have to use efficient data structure on disk to get best performance, another thing is that disk has nothing like malloc to manage allocation. Everything is bare metal and the application developer has to manage allocation, garbage collector, locks etc.

Disk read/write access is based on a block which is usually 4 KB, but memory read/write is based on a cacheline which is 64 Bytes, just this difference in read/write size requires new ways of organizing data to get the best out of the device.  

All the above problems make writing disk based algorithms very challenging.

Lets look at some options of storing data on disk.

Generally file on disk looks some thing like below, each block is of fix sized and it depends on hardware, most of the vendors use 4 KB blocks. IO device provide atomic read/write guarantee at block level. 

Page Layout

Lets unpack disk block to explore options to store data.

Fixed Size

Fixed size data is very common and intuitive way to store data in block provided underlying data is like that and mostly applicable for number variants data type like ( byte, short, int, long , float & double). It is possible to make it work for varchar but padding is required to achieve this. If underlying data can be mapped to fixed size value then this is best option.

Fixed size has good random access property, just by doing simple multiplication specific record can be found for eg to find 3rd record we will use record * sizeof(record) (i.e 3 * 4) to find the offset of data and read it. 
Most of the application has variable record size due to which more flexible storage layout is required. 

Size Prefix

In this approach every record is prefixed with 4 Byte size and followed with data.
This has overhead of extra 4 Bytes and sometime this can be more than actual data and other thing that is not good is that it is sequential access, if last record is required then it requires to scan full page.
One more downsize is what happens when records are updated ? this will cause overflow or fragmentation. 


This is good for queue based system where write is always at the end and read is also large sequential scan. 

Slotted Page

This approach is hybrid one and takes advantage of both fixed and size prefix page.

Slotted page is transformation of Size Prefix page to co-locating related data, for eg all data is together and all size is together.

Single page contains 2 Region
  • Header Region
This section contains some metadata about the page that include version, page id , hash, number of records , compression flag etc. 
  • Data Region
Data section is subdivided in data segment & index segment. Index segment is also called as Slot Array and it can be 4 byte or 2 byte fixed size value, it contains pointer to data segment.
Data Segment is written from left and Slot Array is written from right side. Page is considered full once no space is available for either data segment or Slot.

This approach gives random access to records by using Slot array, every record can be addressed by (PageId, Record Id). Full file content can be seen as a 2 dimensional array.

Slotted page is a very popular layout for many databases. This also allows to build sparse or dense indexes based on page & slot.

Disk Data Structure 

Now we will explore how smallest unit of storage (i.e. page) can be taken to build some data structures on top of it and finally some application using disk based data structure.

Remember disk has no malloc API, so we have to build something like pagealloc that will enable dynamic allocation of blocks/page.

Page Allocator interface is an low level API and API looks something like below.

public interface PageAllocator {
WritePage newPage();

long commit(WritePage page);

ReadPage readByPageId(int pageId);

int noOfPages();

int pageSize();

byte version();

List<PageInfo> pages();

ReadPage readByPageOffset(long offSet);


Application - Time Series Database

Using Page allocator abstraction we will build Time series database that will use Sorted String table as underlying store.

SSTable stores immutable rows that are ordered by some key in files. SSTable is basis for Log structured merge tree that is alternative to B+Tree.

Log structured merge tree powers many popular data stores like BigtableHBaseLevelDB RocksDBWiredTiger, CassandraInfluxDB and many more.


SSTable is made of couple of ordered memory maps & ordered rows on disk. Storage manage sits right in middle to manage sorted structures on disk & memory.


All the write requests are handled by writing to In-Memory ordered map and once those maps are full then get converted to read only In-Memory maps and periodically flushed to disk for durability.  

Writing to such a system is very fast because it is done using in-memory data structure. 


Readers is where this gets more interesting because now read has to hit multiple data structures to find records. First it scans mutable map, then immutable maps and finally on the disk.
Rather than doing a single seek it has to do multiple seeks but since all the data structure be on memory or disk is sorted, so requests can be processed in LOG N time.

Over a period of time a number of files can grow, so a compaction process is required that will merge multiple sorted files and create a small number of files. This compaction process is what makes SSTable as Log structured merge tree.

Some code !
To have some thing working i used ConcurrentSkipListMap from JDK to create In-Memory ordered map and use PageAllocator to flush ordered map to disk.


public interface SortedStringTable<V> {

void append(String key, V value);

void iterate(String from, String to, Function<V, Boolean> consumer);

// API for saving SST table for persistence storage
Collection<PageRecord<V>> buffers();

void remove(int pageId);

void flush();
Working SSTable code can be found @ sst github.

First data structure is ready for our time series database :-)

Time Series 

Time series application will be built on top of SSTable.

Timeseries interface is simple, it looks something like below.

public interface TimeSeriesStore {

<T> EventInfo insert(T row);

void gt(LocalDateTime fromTime, Function<EventInfo, Boolean> consumer);

void lt(LocalDateTime toTime, Function<EventInfo, Boolean> consumer);

void between(LocalDateTime startTime, LocalDateTime endTime, Function<EventInfo, Boolean> consumer);

void flush();

Time series application code can be found @ timeseries repo.

To experiment with some some real time series data, i picked up sample data from Jan Yellow Taxi Trip and loaded in the app. yellow_tripdata_2020-01 has 6+ Million records.

Sample time series application using this data can be found @ NYTaxiRides.java

All the code has good unit test coverage, so feel free to hack and learn.


Disk based algorithm are very cool and understanding it gives good idea about how modern data systems work. You might not build data system from scratch but knowing these algorithm will definitely help in deciding which data system to pick based on trade off.

No comments:

Post a Comment