This is follow up post on rethinking-key-value-store article to explore storage part of system.
Many data systems support plugin based storage layers and it opens a whole set of options to use one from a shelf or build one that suits your needs.
In this post i will share how a new storage system can be built and later it is used for building time series application on top of it.
Before we go deep in disk based algorithm, let's look at why non-volatile storage is required.
In today times when machines with Terabytes RAM are available why do we have to bother to store stuff on disk ?
Couple of good reasons why still having good non-volatile storage manager makes sense today.
- It is cheap
Disk is very cheap as compared RAM, so it does not make sense to store data in expensive store especially when data is not being used frequently. Lots of cloud provider bill can saved! - It is unlimited
Although a machine with big RAM is available but it is still limited , it will not continue get bigger at the same rate as in the past and if we want applications to have the illusion that it has unlimited memory then flushing to disk is a must.
- Allow fast restarts
Think what will happen if application crash ? Application has to rebuild the whole state and it could take very long time before application is available again to serve request. Saving computed data to disk and restoring from it will be way faster.
- Information exchange
How do two application running on different machine can communicate ? For inter application communication in-memory data has to written in wire format so that it can be sent over network.
and many more reasons..
Application has volatile & non-volatile area and storage manager sits in middle of that (.ie RAM and Disk) and provide efficient access to data.
RAM and DISK are very different types of hardware and access patterns are also very different.
On one hand RAM can be accessed randomly and it is fast for both read/write.
Disks are accessed sequentially using blocks and very slow write and slow read, SSD has improved the access time but sequential access is the recommended to get best performance.
Storage managers have to use efficient data structure on disk to get best performance, another thing is that disk has nothing like malloc to manage allocation. Everything is bare metal and the application developer has to manage allocation, garbage collector, locks etc.
Disk read/write access is based on a block which is usually 4 KB, but memory read/write is based on a cacheline which is 64 Bytes, just this difference in read/write size requires new ways of organizing data to get the best out of the device.
All the above problems make writing disk based algorithms very challenging.
Lets look at some options of storing data on disk.
Generally file on disk looks some thing like below, each block is of fix sized and it depends on hardware, most of the vendors use 4 KB blocks. IO device provide atomic read/write guarantee at block level.
Page Layout
Lets unpack disk block to explore options to store data.
Fixed Size
Fixed size has good random access property, just by doing simple multiplication specific record can be found for eg to find 3rd record we will use record * sizeof(record) (i.e 3 * 4) to find the offset of data and read it.
Size Prefix
%3CmxGraphModel%3E%3Croot%3E%3CmxCell%20id%3D%220%22%2F%3E%3CmxCell%20id%3D%221%22%20parent%3D%220%22%2F%3E%3CmxCell%20id%3D%222%22%20value%3D%22%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3BfillColor%3D%23fff2cc%3BstrokeColor%3D%23d6b656%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22890%22%20y%3D%22385%22%20width%3D%22190%22%20height%3D%22130%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3C%2Froot%3E%3C%2FmxGraphModel%3E
This is good for queue based system where write is always at the end and read is also large sequential scan.
Slotted Page
- Header Region
- Data Region
Disk Data Structure
public interface PageAllocator {
WritePage newPage();
long commit(WritePage page);
ReadPage readByPageId(int pageId);
int noOfPages();
int pageSize();
byte version();
List<PageInfo> pages();
ReadPage readByPageOffset(long offSet);
}
Application - Time Series Database
SSTable
Writers
Readers
public interface SortedStringTable<V> {
void append(String key, V value);
void iterate(String from, String to, Function<V, Boolean> consumer);
// API for saving SST table for persistence storage
Collection<PageRecord<V>> buffers();
void remove(int pageId);
void flush();
}
Time Series
public interface TimeSeriesStore {
<T> EventInfo insert(T row);
void gt(LocalDateTime fromTime, Function<EventInfo, Boolean> consumer);
void lt(LocalDateTime toTime, Function<EventInfo, Boolean> consumer);
void between(LocalDateTime startTime, LocalDateTime endTime, Function<EventInfo, Boolean> consumer);
void flush();
}
No comments:
Post a Comment