Encoding/Decoding |
CSV
It is one of the most popular textual format, it has no support for types and makes no distinction between different type of numbers. One of the major restriction is that it only supports scalar types, if we have to store nested or complex object then custom encoding is required. Column and rows values are separated by deliminator and special handling is required when deliminator is part of column value.
Reader application has to parse text and convert into proper type at read time, it produces garbage and is also CPU intensive.
Best thing is that it can be edit in any text editor. All programming language can read and write CSV.
JSON
Both of these text format are very popular inspite of all the inefficiency. Across team if you need any friction less data format interface then go for text based one.
Chronicle/Avro/SBE
These are very popular binary format and very efficient for distributed or trading systems.
SBE is very popular in financial domain and used as replacement of FIX protocol. I shared about it in post inside-simple-binary-encoding-sbe.
Avro is also very popular and it is built by taking lots of learning from protobuffer and thrift. For row based and nested storage this is very good choice. It supports multiple languages. Avro applies some cool encoding tricks to reduce size of message, you can read about it in post integer-encoding-magic
Chronicle-Wire is picking up and i came across this very recently. It has nice abstraction over text and binary message with single unified interface. This library allows to choose different encoding based on usecase.
Lets look at some number now. This is very basic comparison just on size aspect of message. Run your benchmark before making any selection.
We will try to save above 2 records in different format and compare size.
Chronicle is most efficient in this example and i have used RawWire format for this example and it is the most compact option available in library because it only stores data, no schema metadata is stored.
Next one is Avro and SBE, very close in terms of size but sbe is more efficient in terms of encoding/decoding operation.
CSV is not that bad, it took 57 bytes for single row but don't select CSV based on size. As expected JSON takes up more bytes to represent same message. It is taking around 2X more than Chronicle.
Lets look at some real application of these encoding. These encoding can be used for building logs , queues , block storage, RPC message etc.
To explore more i created simple storage library that is backed by file and allows to specific different encoding format.
public interface RecordContainer<T> extends Closeable {
boolean append(T message);
void read(RecordConsumer<T> reader);
void read(long offSet, RecordConsumer<T> reader);
default void close() {
}
int size();
String formatName();
}
This implementation allow to append records at the end of buffer and access the buffer from starting or randomly from given message offset. This can seen as append only unbounded message queue, it has some similarity with kafka topic storage.
RandomAccessFile form java allow to map file content as array buffer and after that file content can be managed like any array.
All the code used in this post is available @ encoding github