Saturday 25 April 2020

Immutability is everywhere even in hard disk.

Storage is cheap and it is used as leverage for building many high performance system. If data is immutable then it is safe to share and maintain multiple copy of data for various access pattern.

Many old design ideas like append-only logs, copy of write , Log structure merge tree, materialized views, replication etc is getting popular due to affordable storage.

In software we have seen many example where immutability is key design decision like Spark is based on immutable RDDs, many key value store like leveldb/rocksdb/hbase is based on immutable storage table, column databases casandra are also taking advantage , HDFS is fully based on immutable file chunk/block.

It is interesting to see that our hardware friends are also using immutability. 

Solid State Drive(SSD) is broken in physical blocks and each block supports finite number of writes, each write operation cause some wear and tear to block. Chip designer use feature of wearing level to evenly distribute write load on each block. Disk controller tracks no of write each block has gone through using non-volatile memory.

Wearing level is based copy-on-write pattern which is flavor of immutability.
Disk maintains logical address space that maps to physical block, the block of disk that stores logical address to physical block mappings supports more write operation as compared to normal data block.

Each write operation whether it is new or update is written at new place in circular fashion to even out writes. This also helps in giving write guarantee when power failure happens.

SSD can be seen like small distributed file system that is made of name node(logical address) and data nodes(physical block).

One question that might come to your mind is that what happens to blocks that never changes ? are they not used to the max level of writes ?

Our hardware friends are very intelligent ! They has come with 2 algorithm 

Dynamic wear leveling
Block undergoing re-writing are written to new blocks. This algorithm is not optimal as read only data blocks never gets same volume of write and cause disk to become unusable even though disk can take more writes.

Static wear leveling 
This approach try to balance write amplification by selecting block containing static data. This is very important algorithm when software are built around immutable files.

Immutable design are now affordable and key to building successful distributed systems.

Friday 17 April 2020

Long Live ETL

Extract transform load is process for pulling data from one datasystem and loading into another datasystem. Datasystem involved are called source system and target system.

Shape of data from source system does not match to the target system, so some conversion is required to make it compatible and that process is called transformation. Transformation is made of map/filter/reduce operations.

To handle the incompatibility between data systems some metadata is required. What type of metadata will be useful ?
It is very common that source data will be transformed to many different shape to handle various business usecase, so it makes sense to use descriptive metadata for source system and prescriptive metadata for target system.

Metadata plays important role in making system both backward and forward compatible.
Many times just having metadata is not enough because some source/target system data is too large or too small to fit.

This is situation when transformation becomes interesting. This means some value have to dropped or set to NULL or to default value, making good decision about this is very important for backward/forward compatibility of transformation. I would say many business success also depends on how this problem is solved! Many integration nightmare can be avoided if this is done properly.

So far we were discussing about single source system but for many use case data from other systems is required to do some transformation like converting userid to name , deriving new column value , lookup encoding and many more.

Adding multiple source system adds complexity in transformation to handle missing data , stale data and many more.

As datasystems are evolving so it is not only about relation store today we see key-value store , document store , graph db , column store , cache , logs etc.

New datasystems are distributed also, so this adds another dimension to complexity of transformation.

Our old relational databases can be also described as it is built using ETL pattern by using change log as source for everything database does

One of the myth about ETL is that it is batch process but that is changing overtime with Stream processor (i.e Spark Streaming , Flink etc) and Pub Sub systems ( Kafka , Pulsur etc). This  enables to do transformation immediately after event is pushed to source system.

Don't get too much carried away by Streaming buzzword, no matter which stream processor or pub sub system you use but you still have to handle above stated challenges or leverage on some of new platform to take care of that.

Invest in transformation/business logic because it is key to building successful system that can be maintained and scaled. 
Keeping it stateless, metadata driven, handle duplicate/retry etc, more importantly write Tests to take good care of it in fast changing time.

Next time when you get below question on your ETL process 
Do you process real time or batch ? 

You answer should be 
It is event based processing.

Long live E T L

Friday 10 April 2020

Testing using mocks

Mock objects are very useful if used right way. I shared some of the experience of using Mock Objects in need-driven-software-development-using post.

What is Mocking in Testing? - Piraveena Paralogarajah - Medium

In this post i share 2 things
- Contract based testing using mocks.
- Patterns to organized mock code.

Contract based testing 
Lets take scenario where you are building Money remittance service. Key component in such type of service is Currency Converter , Bank Service & FX Service.

50000 feet design of fictitious forex service will look something like below.

We have to write FX Service that needs Currency converter & Bank Transfer service. This is perfect scenario for contact based testing

Code snippet for FXService

Our new FX service has to follow below contract
  • Interact with currency converter & Bank Transfer based on input/output contract.
  • Makes 1 call to each of service.

One way to test FX service is to call the real service but that means slow running test and dependency on service that it has to up whenever our test is executing. Sometime calling real service is not an option because it is not developed yet. 

Smart way is to mock these collaborator( Currency Converter & Bank Transfer) and verify interaction using mocking framework.
Another advantage of testing with mocks that it enables to verify that both currency & bank transfer service are used by fxservice in expected way.

Lets look at mock based test.

This test is written using EasyMock framework and is mocking reply from collaborators.   

Write the test that you want to read

One of the important property of good test is that it is enjoyable to read. 
Mocks can make this goal more difficult to achieve as setup code for unit test will have very complex assembling logic that will be mix of some normal object set and some mocking expectation. I am sure you have seen before function in test that is used as dumping ground for setup required for all the tests in class. 

Lets look at some mock code we used earlier and try to improve it

Another way

Both of the above code is doing same thing but later one which is written with jmock has nice sugar method to express same thing.
This helps in keeping expectation clean and in context with code that is being tested. Collaborator object in the context are mocked out.

Simple pattern but very effective in making test readable.

Code used in this post is available on github