Follow by Email

Friday, 10 July 2020

Data encoding and storage

Data encoding and storage format is evolving field, it has seen so many changes starting from naive text based encoding to advance compact nested binary format.

Encoder and decoder
Encoding/Decoding

Selecting correct encoding/storage format has big impact on application performance and how easily it can evolve. Data encoding has big impact on whether application is backward/forward compatible.  
Selecting right encoding format can be one of the important factor for data driven application agility. 

Application developer tends to makes default choice of text(xml, csv or json) based encoding because it is human readable and language agonist. 
Text format are not very efficient, they are take time time/space and also struggle to evolve. If someone care about efficiency then binary format is the way to go. 

In this post i will compare text vs binary encoding and build simple persistent storage that supports flexible encoding.

We will compare popular text/binary encoding like csv , json  , avro , chronicle and sbe



I will use above Trade object as example for this comparison. 

CSV

It is one of the most popular textual format, it has no support for types and makes no distinction between different type of numbers. One of the major restriction is that it only supports scalar types, if we have to store nested or complex object then custom encoding is required. Column and rows values are separated by deliminator and special handling is required when deliminator is part of column value.

Reader application has to parse text and convert into proper type at read time, it produces garbage and is also CPU intensive.

Best thing is that it can be edit in any text editor. All programming language can read and write CSV.


JSON

This is what drives Web today. Majorities of micro services that are user facing are using JSON for REST APIs.
This address some of the issues with CSV by making distinction between string and number, also support nested types like Map,Array, Lists etc. It is possible to have schema for JSON message but it is not in practice because it takes ways flexible schema. This is new XML these days. 
One of major drawback is size, size of JSON message is more as it has to keep key/attribute name as part of message. I have heard in some document based database attribute names takes up more than 50% of the space, so be careful when you select attribute name in json document.  

Both of these text format are very popular inspite of all the inefficiency. Across team if you need any friction less data format interface then go for text based one.

Chronicle/Avro/SBE

These are very popular binary format and very efficient for distributed or trading systems.

SBE is very popular in financial domain and used as replacement of FIX protocol. I shared about it in post inside-simple-binary-encoding-sbe.

Avro is also very popular and it is built by taking lots of learning from protobuffer and thrift. For row based and nested storage this is very good choice. It supports multiple languages. Avro applies some cool encoding tricks to reduce size of message, you can read about it in post integer-encoding-magic

Chronicle-Wire is picking up and i came across this very recently. It has nice abstraction over text and binary message with single unified interface. This library allows to choose different encoding based on usecase. 


Lets look at some number now. This is very basic comparison just on size aspect of message. Run your benchmark before making any selection.





We will try to save above 2 records in different format and compare size.


Chronicle is most efficient in this example and i have used RawWire format for this example and it is the most compact option available in library because it only stores data, no schema metadata is stored. 

Next one is Avro and SBE, very close in terms of size but sbe is more efficient in terms of encoding/decoding operation.

CSV is not that bad, it took 57 bytes for single row but don't select CSV based on size. As expected JSON takes up more bytes to represent same message. It is taking around 2X more than Chronicle.

Lets look at some real application of these encoding. These encoding can be used for building logs , queues , block storage, RPC message etc.

To explore more i created simple storage library that is backed by file and allows to specific different encoding format.

public interface RecordContainer<T> extends Closeable {
boolean append(T message);

void read(RecordConsumer<T> reader);

void read(long offSet, RecordConsumer<T> reader);

default void close() {
}

int size();

String formatName();

}

This implementation allow to append records at the end of buffer and access the buffer from starting or randomly from given message offset. This can seen as append only unbounded message queue, it has some similarity with kafka topic storage.

RandomAccessFile form java allow to map file content as array buffer and after that file content can be managed like any array.

All the code used in this post is available @ encoding github


 

Sunday, 28 June 2020

Ship your function

Now a days function as service(FaaS) is trending in serverless area and it is enabling new opportunity that allows to send function on the fly to server and it will start executing immediately.   

Code as data as code.

This is helps in building application that adapts to changing users needs very quickly.
Function_as_a_service is popular offering from cloud provider like Amazon , Microsoft, Google etc.

FaaS has lot of similarity with Actor model that talks about sending message to Actors and they perform local action, if code can be also treated like data then code can also be sent to remote process and it can execute function locally. 

I remember Joe Armstrong talking about how during time when he was building Erlang he used to send function to server to become HTTP server or smtp server etc. He was doing this in 1986!

Lets look at how we can save executable function and execute it later.
I will use java as a example but it can be done in any language that allows dynamic linking. Javascript will be definitely winner in dynamic linking. 

Quick revision
  Lets have quick look at functions/behavior in java


Nothing much to explain above code, it is very basic transformation.

Save function
Lets try to save one of these function and see what happens. 


Above code looks perfect but it fails at runtime with below error

java.io.NotSerializableException: faas.FunctionTest$$Lambda$266/1859039536 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at faas.FunctionTest.save_function(FunctionTest.java:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Lambda functions are not serializable by default.
Java has nice trick about using cast expression to add additional bound, more details are available at Cast Expressions.

In nutshell it will look something like below


This technique allows to convert any functional interface to bytes and reuse it later. It is used in JDK at various places like TreeMap/TreeSet as these data structure has comparator as function and also supports serialization.
   
With basic thing working lets try to build something more useful.

We have to hide & Serialized magic to make code more readable and this can be achieved by functional interface that extends from base interface and just adds Serializable, it will look something like below


Once we take care of boilerplate then it becomes very easy to write the functions that are Serialization ready.



With above building block we can save full transformation(map/filter/reduce/collect etc) and ship to sever for processing. This also allows to build computation that can recomputed if required.

Spark is distributed processing engine that use such type of pattern where  it persists transformation function and use that for doing computation on multiple nodes. 

So next time you want to build some distributed processing framework then look into this pattern or want to take it to extreme then send patched function to live server in production to fix the issue. 

Code used in in post is available @ faas

Tuesday, 23 June 2020

Bit fiddling every programmer should know

Bit fiddling looks like magic, it allows to do so many things in very efficient way.
In this post i will share some of the real world example where bit operation can be used to gain good performance.

Technology Basics: Bits and Bytes - Business Technology, Gadgets ...
Bit wise operation bootcamp
Bit operator include.
 - AND ( &)
 - OR ( | )
 - Not ( ~)
 - XOR( ^)
 - Shifts ( <<, >>)

Wikipedia has good high level overview of Bitwise_operation. While preparing for this post i wrote learning test and it is available learningtest github project. Learning test is good way to explore anything before you start deep dive. I plan to write detail post on Learning Test later.

In these examples i will be using below bits tricks as building block for solving more complex problem.
  • countBits  - Count number of 1 bits in binary
  • bitParity - Check bit added to binary code
  • set/clear/toggle - Manipulating single bit
  • pow2 - Find next power of 2 and using it as mask.

Code for these function is available @ Bits.java on github and unit test is available @ BitsTest.java

Lets look at some real world problems now.

Customer daily active tracking
 E-commerce company keep important metrics like which days customer was active or did some business. This metrics becomes very important for building models that can be used to improve customer engagement. Such type of metrics is also useful for fraud or risk related usecase.
Investment banks also use such metrics for Stocks/Currency for building trading models etc.

Using simple bit manipulation tricks 30 days of data can be packed in only 4 bytes, so to store whole year of info only 48 bytes are required.

Code snippet


Apart from compact storage this pattern have good data locality because whole thing can be read by processor using single load operation.

Transmission errors
This is another area where bit manipulation shines. Think you are building distributed storage block management software or building some file transfer service,  one of the thing required for such service is to make sure transfer was done properly and no data was lost during transmission. This can be done using bit parity(odd or even) technique, it involves keeping number of '1' bits to odd or even.


Another way to do such type of verification is Hamming_distance. Code snippet for hamming distance for integer values.



Very useful way to keep data integrity with no extra overhead.
Locks
Lets get into concurrency now. Locks are generally not good for performance but some time we have to use it.  Many lock implementation are very heavy weight and also hard to share between programs .In this example we will try to build lock and this will be memory efficient lock, 32 locks can be managed using single Integer.

Code snippet

This example is using single bit setting trick along with AtomicInteger to make this code threadsafe.
This is very lightweight lock. As this example is related to concurrency so this will have some issues due to false sharing and it is possible to address this by using some of the technique mention in scalable-counters-for-multi-core post.

Fault tolerant disk
Lets get into some serious stuff. Assume we have 2 disk and we want to make keep copy of data so that we can restore data incase one of the disk fails, naive way of doing this is to keep backup copy of every disk, so if you have 1 TB then additional 1 TB is required. Cloud provider like Amazon will be very  happy if you use such approach.
Just by using XOR(^) operator we can keep backup for pair of disk on single disk, we get 50% gain.
50% saving on storage expense.

Code snippet testing restore logic.

Disk code is available @ RaidDisk.java

Ring buffer
Ring buffer is very popular data structure when doing async processing , buffering events before writing to slow device. Ring buffer is bounded buffer and that helps in having zero allocation buffer in critical execution path, very good fit for low latency programming.
One of the common operation is finding slot in buffer for write/read and it is done by using Mod(%) operator, mod or divide operator is not good for performance because it stalls execution because CPU has only 1 or 2 ports for processing divide but it has many ports for bit wise operation.

In this example we will use bit wise operator to find mod and it is only possible if mod number is powof2. I think it is one of the trick that everyone should know.

n & (n-1)

If n is power of 2 then 'x & (n-1)' can be used to find mod in single instruction. This is so popular that it is used in many places, JDK hashmap was also using this to find slot in map.



Conclusion
I have just shared at very high level on what is possible with simple bit manipulation techniques.
Bit manipulation enable many innovative ways of solving problem. It is always good to have extra tools in programmer kit and many things are timeless applicable to every programming language.

All the code used in post is available @ bits repo.

Saturday, 23 May 2020

Data modeling is everything

Everyone is aware of relation data modeling and it has served industry for long time but as data pressure increased relation data modeling that is based on Edgar_F._Codd rules are not scaling well. 

Big data, data modeling, data modelling icon
Data modeling


Those rules were based on hardware limit in 1970s and RDMS database took all that stuff and build database that was good fit based on hardware limit of 70s.

We are in 2020 and time has changed, hardware is so much cheap and better. Look at storage price over period of time.



Many data system has taken advantage of cheap storage to build highly available & reliable systems. Some of RDMS are still playing catch up game. I would say NoSQL has taken lead by leveraging this.

Data modeling is very different when storage is not an issue or bottleneck, today limit is CPU because they are not getting faster, you can have more but not faster. Lets look at some of the data modeling technique that can be used today even with your favorite RDBMS to get blazing fast performance. 


One thing before diving in modeling that real world is relational and data will always have relations, so any modeling technique that can't handle 1-2-1 , 1-2-Many or Many-2-Many etc is useless. Each data model has trade off and it is designed with purpose and it could be optimizing for write or read for specific access pattern.

Few things that makes RDBMS slow are unbounded table scans , Joins and Aggregation. If we build data model that does not need these slow operation then we can build blazing fast systems!
Most of RDBMS is key value store with B-Tree index on top of it, if all the query to DB can be turned into key lookup or small index scan then we can best out of databases.

With that lets dive into some of the ideas to avoid joining Dataset.

Use non scalar attribute type
Specially for complex relationship like customer to preference, customer to address , Fedex delivery to destinations, customer to payment options etc we tend to create table to have multiple rows and then join with foreign key at read time. 

If such type of request comes to your system millions time a day and then it not really good option to do join million time. What should we do then ? 
Welcome to non scalar type of your data storage system use maps, list, vector or blob.

I know this may sound like crazy but just by using above stated type you have avoided join and saved big CPU cost million time. You should also know the trade off this pattern, now it is no more possible to use plan SQL to see the value of non scalar column but that is just tooling gap and can be addressed.

Code snippet for such data model



Many join i have avoided using this pattern and user experience has improved so much with this. Database will load this column for free and you can assemble it in application layer if database does not support that.
One more thing to be aware with this is that column value should not be unbounded, put some limit and chunk it when limit is crossed to keep your favorite database happy.

Duplicate immutable attributes
This pattern need more courage to use. If you clearly know that some thing in system is immutable like product name ,  desc , brand , seller etc then anytime product is referred like order item refer to product that is bought then you can copy all the immutable attributes of product to order item. 
  

Showing order details is very frequent access pattern in e-commerce site and such type of solution will help with that access pattern.
Trade off involved is that if attributes are not immutable then this has overhead of updating it whenever referenced entity is changed but if that does not changes frequently then you will benefit with this in big way.

- Single table or multi entity table
This will definitely get most resistant because seems like insult to RDBMS but we have this done in past remember those parent child query where both parent and child is stored in single table and we join with parent and id. Employee and manager was modeled like this for many years. 

Lets take another example for parent & child relationship. 
Class and student is also classic example where class will have many student and vice-versa.

Anytime we want to show details of all the student for specific class then we make first query to get class id and then all the student for that class or we can join based on class id.

How can we avoid join or sequential load dependency? 
When we write the join query we are trying to create de-normalized view at runtime and discard it after request is served but what if this view is created at the write time then we managed to pre-join data and we just avoid the join at runtime.


This model is using generic name(pk1,pk2) for identifying entity in single table, this looks little different but is very powerful because with pre-join data we can get answer of questions like 

 - Single class details ( pk1="class:c1" , pk2="class:c1"
 - Single student for given class( pk1="class:c1" , pk2="student:s1" ).

 - Both class and all the students (pk1="class:c1")
 - All the student of given class (pk1="class:c1" ,  pk2!="class:c1" )

Lets add one more scenario to get all the classes student is attending. This need just flipping pk1 and pk2 to make it something like 

Registration("student:s1", "class:C1", className = "science", classDesc = "Science for Primary kids")

Registration("student:s1", "class:C2", className = "Maths", classDesc = "Basic Algebra")

I would say that this is one of the most different one and needs out of the box thinking to give a try but with this you can solve many use case.

 - Projected column index
This is common one where projection and aggregation is done upfront and can be treated like de-normalization view at write time. If aggregation is linear like sum, avg, max , min then aggregation can be done incremental. This will avoid running Group By query for every request. Think how much CPU can be saved by using this simple pattern.

- Extreme projection index
This is extension of above pattern taking to next level by calculating projection at every level and very effective for hierarchy based query. This can help in totally avoiding group and aggregation at read time.

   
Conclusion
All the patterns shared are nothing new and are taking advantage of cheap storage to avoid de-normalization at read time. 
If all the joins and most aggregation can be done at write time then reads are so fast that it might feel that you are not hitting any database.

In this post we did not touch on data partition strategy but that also plays important role in building system that user love to use.

Time to get little closer to your data and spend time in understanding access pattern and hardware limit before design system!

Saturday, 25 April 2020

Immutability is everywhere even in hard disk.



Storage is cheap and it is used as leverage for building many high performance system. If data is immutable then it is safe to share and maintain multiple copy of data for various access pattern.


Many old design ideas like append-only logs, copy of write , Log structure merge tree, materialized views, replication etc is getting popular due to affordable storage.

In software we have seen many example where immutability is key design decision like Spark is based on immutable RDDs, many key value store like leveldb/rocksdb/hbase is based on immutable storage table, column databases casandra are also taking advantage , HDFS is fully based on immutable file chunk/block.

It is interesting to see that our hardware friends are also using immutability. 

Solid State Drive(SSD) is broken in physical blocks and each block supports finite number of writes, each write operation cause some wear and tear to block. Chip designer use feature of wearing level to evenly distribute write load on each block. Disk controller tracks no of write each block has gone through using non-volatile memory.

Wearing level is based copy-on-write pattern which is flavor of immutability.
Disk maintains logical address space that maps to physical block, the block of disk that stores logical address to physical block mappings supports more write operation as compared to normal data block.

Each write operation whether it is new or update is written at new place in circular fashion to even out writes. This also helps in giving write guarantee when power failure happens.

SSD can be seen like small distributed file system that is made of name node(logical address) and data nodes(physical block).

One question that might come to your mind is that what happens to blocks that never changes ? are they not used to the max level of writes ?

Our hardware friends are very intelligent ! They has come with 2 algorithm 

Dynamic wear leveling
Block undergoing re-writing are written to new blocks. This algorithm is not optimal as read only data blocks never gets same volume of write and cause disk to become unusable even though disk can take more writes.

Static wear leveling 
This approach try to balance write amplification by selecting block containing static data. This is very important algorithm when software are built around immutable files.

Immutable design are now affordable and key to building successful distributed systems.

Friday, 17 April 2020

Long Live ETL

Extract transform load is process for pulling data from one datasystem and loading into another datasystem. Datasystem involved are called source system and target system.

Shape of data from source system does not match to the target system, so some conversion is required to make it compatible and that process is called transformation. Transformation is made of map/filter/reduce operations.


To handle the incompatibility between data systems some metadata is required. What type of metadata will be useful ?
It is very common that source data will be transformed to many different shape to handle various business usecase, so it makes sense to use descriptive metadata for source system and prescriptive metadata for target system.

Metadata plays important role in making system both backward and forward compatible.
 
Many times just having metadata is not enough because some source/target system data is too large or too small to fit.



This is situation when transformation becomes interesting. This means some value have to dropped or set to NULL or to default value, making good decision about this is very important for backward/forward compatibility of transformation. I would say many business success also depends on how this problem is solved! Many integration nightmare can be avoided if this is done properly.

So far we were discussing about single source system but for many use case data from other systems is required to do some transformation like converting userid to name , deriving new column value , lookup encoding and many more.

Adding multiple source system adds complexity in transformation to handle missing data , stale data and many more.

As datasystems are evolving so it is not only about relation store today we see key-value store , document store , graph db , column store , cache , logs etc.

New datasystems are distributed also, so this adds another dimension to complexity of transformation.

Our old relational databases can be also described as it is built using ETL pattern by using change log as source for everything database does

One of the myth about ETL is that it is batch process but that is changing overtime with Stream processor (i.e Spark Streaming , Flink etc) and Pub Sub systems ( Kafka , Pulsur etc). This  enables to do transformation immediately after event is pushed to source system.

Don't get too much carried away by Streaming buzzword, no matter which stream processor or pub sub system you use but you still have to handle above stated challenges or leverage on some of new platform to take care of that.

Invest in transformation/business logic because it is key to building successful system that can be maintained and scaled. 
Keeping it stateless, metadata driven, handle duplicate/retry etc, more importantly write Tests to take good care of it in fast changing time.

Next time when you get below question on your ETL process 
Do you process real time or batch ? 

You answer should be 
It is event based processing.


Long live E T L

Friday, 10 April 2020

Testing using mocks

Mock objects are very useful if used right way. I shared some of the experience of using Mock Objects in need-driven-software-development-using post.

What is Mocking in Testing? - Piraveena Paralogarajah - Medium

In this post i share 2 things
- Contract based testing using mocks.
- Patterns to organized mock code.


Contract based testing 
Lets take scenario where you are building Money remittance service. Key component in such type of service is Currency Converter , Bank Service & FX Service.

50000 feet design of fictitious forex service will look something like below.





We have to write FX Service that needs Currency converter & Bank Transfer service. This is perfect scenario for contact based testing

Code snippet for FXService



Our new FX service has to follow below contract
  • Interact with currency converter & Bank Transfer based on input/output contract.
  • Makes 1 call to each of service.

One way to test FX service is to call the real service but that means slow running test and dependency on service that it has to up whenever our test is executing. Sometime calling real service is not an option because it is not developed yet. 

Smart way is to mock these collaborator( Currency Converter & Bank Transfer) and verify interaction using mocking framework.
Another advantage of testing with mocks that it enables to verify that both currency & bank transfer service are used by fxservice in expected way.

Lets look at mock based test.



This test is written using EasyMock framework and is mocking reply from collaborators.   

Write the test that you want to read

One of the important property of good test is that it is enjoyable to read. 
Mocks can make this goal more difficult to achieve as setup code for unit test will have very complex assembling logic that will be mix of some normal object set and some mocking expectation. I am sure you have seen before function in test that is used as dumping ground for setup required for all the tests in class. 

Lets look at some mock code we used earlier and try to improve it


Another way

Both of the above code is doing same thing but later one which is written with jmock has nice sugar method to express same thing.
This helps in keeping expectation clean and in context with code that is being tested. Collaborator object in the context are mocked out.

Simple pattern but very effective in making test readable.

Code used in this post is available on github