Follow by Email

Saturday, 23 May 2020

Data modeling is everything

Everyone is aware of relation data modeling and it has served industry for long time but as data pressure increased relation data modeling that is based on Edgar_F._Codd rules are not scaling well. 

Big data, data modeling, data modelling icon
Data modeling


Those rules were based on hardware limit in 1970s and RDMS database took all that stuff and build database that was good fit based on hardware limit of 70s.

We are in 2020 and time has changed, hardware is so much cheap and better. Look at storage price over period of time.



Many data system has taken advantage of cheap storage to build highly available & reliable systems. Some of RDMS are still playing catch up game. I would say NoSQL has taken lead by leveraging this.

Data modeling is very different when storage is not an issue or bottleneck, today limit is CPU because they are not getting faster, you can have more but not faster. Lets look at some of the data modeling technique that can be used today even with your favorite RDBMS to get blazing fast performance. 


One thing before diving in modeling that real world is relational and data will always have relations, so any modeling technique that can't handle 1-2-1 , 1-2-Many or Many-2-Many etc is useless. Each data model has trade off and it is designed with purpose and it could be optimizing for write or read for specific access pattern.

Few things that makes RDBMS slow are unbounded table scans , Joins and Aggregation. If we build data model that does not need these slow operation then we can build blazing fast systems!
Most of RDBMS is key value store with B-Tree index on top of it, if all the query to DB can be turned into key lookup or small index scan then we can best out of databases.

With that lets dive into some of the ideas to avoid joining Dataset.

Use non scalar attribute type
Specially for complex relationship like customer to preference, customer to address , Fedex delivery to destinations, customer to payment options etc we tend to create table to have multiple rows and then join with foreign key at read time. 

If such type of request comes to your system millions time a day and then it not really good option to do join million time. What should we do then ? 
Welcome to non scalar type of your data storage system use maps, list, vector or blob.

I know this may sound like crazy but just by using above stated type you have avoided join and saved big CPU cost million time. You should also know the trade off this pattern, now it is no more possible to use plan SQL to see the value of non scalar column but that is just tooling gap and can be addressed.

Code snippet for such data model



Many join i have avoided using this pattern and user experience has improved so much with this. Database will load this column for free and you can assemble it in application layer if database does not support that.
One more thing to be aware with this is that column value should not be unbounded, put some limit and chunk it when limit is crossed to keep your favorite database happy.

Duplicate immutable attributes
This pattern need more courage to use. If you clearly know that some thing in system is immutable like product name ,  desc , brand , seller etc then anytime product is referred like order item refer to product that is bought then you can copy all the immutable attributes of product to order item. 
  

Showing order details is very frequent access pattern in e-commerce site and such type of solution will help with that access pattern.
Trade off involved is that if attributes are not immutable then this has overhead of updating it whenever referenced entity is changed but if that does not changes frequently then you will benefit with this in big way.

- Single table or multi entity table
This will definitely get most resistant because seems like insult to RDBMS but we have this done in past remember those parent child query where both parent and child is stored in single table and we join with parent and id. Employee and manager was modeled like this for many years. 

Lets take another example for parent & child relationship. 
Class and student is also classic example where class will have many student and vice-versa.

Anytime we want to show details of all the student for specific class then we make first query to get class id and then all the student for that class or we can join based on class id.

How can we avoid join or sequential load dependency? 
When we write the join query we are trying to create de-normalized view at runtime and discard it after request is served but what if this view is created at the write time then we managed to pre-join data and we just avoid the join at runtime.


This model is using generic name(pk1,pk2) for identifying entity in single table, this looks little different but is very powerful because with pre-join data we can get answer of questions like 

 - Single class details ( pk1="class:c1" , pk2="class:c1"
 - Single student for given class( pk1="class:c1" , pk2="student:s1" ).

 - Both class and all the students (pk1="class:c1")
 - All the student of given class (pk1="class:c1" ,  pk2!="class:c1" )

Lets add one more scenario to get all the classes student is attending. This need just flipping pk1 and pk2 to make it something like 

Registration("student:s1", "class:C1", className = "science", classDesc = "Science for Primary kids")

Registration("student:s1", "class:C2", className = "Maths", classDesc = "Basic Algebra")

I would say that this is one of the most different one and needs out of the box thinking to give a try but with this you can solve many use case.

 - Projected column index
This is common one where projection and aggregation is done upfront and can be treated like de-normalization view at write time. If aggregation is linear like sum, avg, max , min then aggregation can be done incremental. This will avoid running Group By query for every request. Think how much CPU can be saved by using this simple pattern.

- Extreme projection index
This is extension of above pattern taking to next level by calculating projection at every level and very effective for hierarchy based query. This can help in totally avoiding group and aggregation at read time.

   
Conclusion
All the patterns shared are nothing new and are taking advantage of cheap storage to avoid de-normalization at read time. 
If all the joins and most aggregation can be done at write time then reads are so fast that it might feel that you are not hitting any database.

In this post we did not touch on data partition strategy but that also plays important role in building system that user love to use.

Time to get little closer to your data and spend time in understanding access pattern and hardware limit before design system!

Saturday, 25 April 2020

Immutability is everywhere even in hard disk.



Storage is cheap and it is used as leverage for building many high performance system. If data is immutable then it is safe to share and maintain multiple copy of data for various access pattern.


Many old design ideas like append-only logs, copy of write , Log structure merge tree, materialized views, replication etc is getting popular due to affordable storage.

In software we have seen many example where immutability is key design decision like Spark is based on immutable RDDs, many key value store like leveldb/rocksdb/hbase is based on immutable storage table, column databases casandra are also taking advantage , HDFS is fully based on immutable file chunk/block.

It is interesting to see that our hardware friends are also using immutability. 

Solid State Drive(SSD) is broken in physical blocks and each block supports finite number of writes, each write operation cause some wear and tear to block. Chip designer use feature of wearing level to evenly distribute write load on each block. Disk controller tracks no of write each block has gone through using non-volatile memory.

Wearing level is based copy-on-write pattern which is flavor of immutability.
Disk maintains logical address space that maps to physical block, the block of disk that stores logical address to physical block mappings supports more write operation as compared to normal data block.

Each write operation whether it is new or update is written at new place in circular fashion to even out writes. This also helps in giving write guarantee when power failure happens.

SSD can be seen like small distributed file system that is made of name node(logical address) and data nodes(physical block).

One question that might come to your mind is that what happens to blocks that never changes ? are they not used to the max level of writes ?

Our hardware friends are very intelligent ! They has come with 2 algorithm 

Dynamic wear leveling
Block undergoing re-writing are written to new blocks. This algorithm is not optimal as read only data blocks never gets same volume of write and cause disk to become unusable even though disk can take more writes.

Static wear leveling 
This approach try to balance write amplification by selecting block containing static data. This is very important algorithm when software are built around immutable files.

Immutable design are now affordable and key to building successful distributed systems.

Friday, 17 April 2020

Long Live ETL

Extract transform load is process for pulling data from one datasystem and loading into another datasystem. Datasystem involved are called source system and target system.

Shape of data from source system does not match to the target system, so some conversion is required to make it compatible and that process is called transformation. Transformation is made of map/filter/reduce operations.


To handle the incompatibility between data systems some metadata is required. What type of metadata will be useful ?
It is very common that source data will be transformed to many different shape to handle various business usecase, so it makes sense to use descriptive metadata for source system and prescriptive metadata for target system.

Metadata plays important role in making system both backward and forward compatible.
 
Many times just having metadata is not enough because some source/target system data is too large or too small to fit.



This is situation when transformation becomes interesting. This means some value have to dropped or set to NULL or to default value, making good decision about this is very important for backward/forward compatibility of transformation. I would say many business success also depends on how this problem is solved! Many integration nightmare can be avoided if this is done properly.

So far we were discussing about single source system but for many use case data from other systems is required to do some transformation like converting userid to name , deriving new column value , lookup encoding and many more.

Adding multiple source system adds complexity in transformation to handle missing data , stale data and many more.

As datasystems are evolving so it is not only about relation store today we see key-value store , document store , graph db , column store , cache , logs etc.

New datasystems are distributed also, so this adds another dimension to complexity of transformation.

Our old relational databases can be also described as it is built using ETL pattern by using change log as source for everything database does

One of the myth about ETL is that it is batch process but that is changing overtime with Stream processor (i.e Spark Streaming , Flink etc) and Pub Sub systems ( Kafka , Pulsur etc). This  enables to do transformation immediately after event is pushed to source system.

Don't get too much carried away by Streaming buzzword, no matter which stream processor or pub sub system you use but you still have to handle above stated challenges or leverage on some of new platform to take care of that.

Invest in transformation/business logic because it is key to building successful system that can be maintained and scaled. 
Keeping it stateless, metadata driven, handle duplicate/retry etc, more importantly write Tests to take good care of it in fast changing time.

Next time when you get below question on your ETL process 
Do you process real time or batch ? 

You answer should be 
It is event based processing.


Long live E T L

Friday, 10 April 2020

Testing using mocks

Mock objects are very useful if used right way. I shared some of the experience of using Mock Objects in need-driven-software-development-using post.

What is Mocking in Testing? - Piraveena Paralogarajah - Medium

In this post i share 2 things
- Contract based testing using mocks.
- Patterns to organized mock code.


Contract based testing 
Lets take scenario where you are building Money remittance service. Key component in such type of service is Currency Converter , Bank Service & FX Service.

50000 feet design of fictitious forex service will look something like below.





We have to write FX Service that needs Currency converter & Bank Transfer service. This is perfect scenario for contact based testing

Code snippet for FXService



Our new FX service has to follow below contract
  • Interact with currency converter & Bank Transfer based on input/output contract.
  • Makes 1 call to each of service.

One way to test FX service is to call the real service but that means slow running test and dependency on service that it has to up whenever our test is executing. Sometime calling real service is not an option because it is not developed yet. 

Smart way is to mock these collaborator( Currency Converter & Bank Transfer) and verify interaction using mocking framework.
Another advantage of testing with mocks that it enables to verify that both currency & bank transfer service are used by fxservice in expected way.

Lets look at mock based test.



This test is written using EasyMock framework and is mocking reply from collaborators.   

Write the test that you want to read

One of the important property of good test is that it is enjoyable to read. 
Mocks can make this goal more difficult to achieve as setup code for unit test will have very complex assembling logic that will be mix of some normal object set and some mocking expectation. I am sure you have seen before function in test that is used as dumping ground for setup required for all the tests in class. 

Lets look at some mock code we used earlier and try to improve it


Another way

Both of the above code is doing same thing but later one which is written with jmock has nice sugar method to express same thing.
This helps in keeping expectation clean and in context with code that is being tested. Collaborator object in the context are mocked out.

Simple pattern but very effective in making test readable.

Code used in this post is available on github

Thursday, 19 March 2020

Incremental build with maven

This is 2020 and if you are starting any new java based project then gradle should be first option but for some reason if you are still stuck with Maven then you might find this post useful.

Image result for maven cartoon

Maven java/scala compiler plugin has decent support for incremental compilation but it is not able to handle few edge case like

  • Trigger compilation when file is deleted from source folder.
  • Skip unit test when code is not changed.

Just to handle deleted file scenario most of the time we have to run "mvn clean install" and that means full code is complied and unit test are executed. 

Compilation of scala code is slow and if project contain slow running test like starting webserver , spark context, IO etc then this becomes more worse. In many case wait time could be minutes.
I am not accounting for wasted CPU cycles for running test even when code is not changed.

As an experiment i took some ideas from Gradle and wrote add-on maven plugin that handles above stated issues by

 1. Cleaning target location when code is changed and trigger full build.
 2. Skip unit test execution when code is not changed.

Both of the these features can help in reducing compilation time significantly because most of the time only few modules are changed and previous build output can be used. You can get blazing fast builds by enabling this plugin.

How to use plugin
This plugin is added at pre-clean stage, add below entry to pom.xml and use "mvn pre-clean install" 

<plugin>
                <groupId>mavenplugin</groupId>
                <artifactId>compilerplugin</artifactId>
                <version>1.0-SNAPSHOT</version>
                <executions>
                    <execution>
                        <id>pre-clean</id>
                        <phase>pre-clean</phase>
                        <goals>
                            <goal>inc</goal>
                        </goals>
                    </execution>
                </executions> 
            </plugin> 


Plugin code is available @ compilerplugin github repo

sandbox code using plugin is available @ compilerplugintest github repo

Conclusion
Always collect metrics on build like how long it takes to compile , time taken by test , package size, dependency etc. Once you start measuring then you will notice how slow builds are and that also need same love as code.

Fast build is first step that enable continuous delivery.

Saturday, 14 March 2020

Hands on Optional value

Optional is in air due to coronavirus, everything is becoming optional like optional public gathering , optional work from home, optional travel etc.

Image result for option scala

I though it is good time to talk about real "Optional" in software engineering that deals with NULL reference.

Tony Hoare confessed that he made billion dollar mistake by inventing Null. If you have not seen his talk then i will suggest to have look at Null-References-The-Billion-Dollar-Mistake.

I will share some of the anti pattern with null and how it can be solved using abstraction like Optional or MayBe.

For this example we will use simple value object that can have some null values.


This value object can have null value for email & phone number.

Scenario: Contact Person on both email and phone number

Not using optional
First attempt will be based on checking null like below

This is how it has been done for years. One more common pattern with collection result.

Use optional in wrong way

This is little better but all the goodness of Optional is thrown away by adding if/else block in code.

Always Happy optional


It is good be happy but when you try that with Optional you are making big assumption or you don't need optional.

Nested property optional
For this scenario we will extend Person object and add Home property. Not everyone can own home so it is good candidate that it will be not available .
Lets see how contacting person scenario work in this case

This is where it start to become worse that code will have tons of nested null checks.

Priority based default
for this scenario we first try to contact person on home address and if it is not available then contact on office address.
Such type of scenario require use of advance control flow for early return and makes code hard to understand and maintain.

These are some of the common pattern where optional are not used or used in wrong way.

Optional usage patterns
Lets look at some of good ways of using optional.

Make property optional based on domain knowledge
It is very easy to makes property optional.

Yes it is allowed to make get Optional, no one will hang you for that and feel free to do that without fear. Once that change is done we can write something like below

It looks neat, first step to code without explicit if else on application layer.

Use some power of Optional

Optional is just like stream, we get all functional map,filter etc support. In above example we are checking for OptIn before contacting.

Always happy optional
Always happy optional that calls "get" without check will cause runtime error on sunday midnight, so it advised to use ifPresent

Nested Optional

Flatmap does the magic and handles null check for home and convert  insurance object also.

Priority based default

This example is taking both home & office address and pick the first one that has value for sending notification. This particular pattern avoids lots of nested loops.

Else branch
Optional has lots of ways to handle else part of the scenario like returning some default value(orElse) , lazy default value (orElseGet) or throw exception(orElseThrow).

What is not good about optional
Each design choice has some trade off and optional also has some. It is important to know what are those so that you can make careful decision.
Memory overhead
Optional is container that holds value, so extra object is created for every value. Special consideration is required when it holds primitive value. If some performance sensitive code will be impacted by extra object creation via optional then it is better to use null.

Memory indirection
As optional is container , so every access to value need extra jump to get real value. Optional is not good choice for element in array or collection.

No serialization
I think this is good decision by Jdk team that does not encourage people to make instance variable optional. You can wrap instance variable to Optional at runtime or when required for processing.

All the example used in post are available @ optionals github repo

If you like the post then you can follow me on twitter.

Monday, 20 January 2020

Better product by documenting trust boundary

Have you ever faced issue when you trusted system, team or product and that resulted in failure of feature or system?




In fast changing progressive delivery, definition of trust keeps on changing. Some of the trust issues that happens are.
  • Other system sends junk data.
  • Other system does not maintain constraint of data like unique , null , referential integrity.
  • Third party library causing unknown side effect.
  • Nitpicking user or QA.
  • Ignorant users.
  • Demanding product owner.
As engineer you have to continuously revisit trust boundary otherwise someone else will find gap and exploit system. Trust issues happens between teams or within single team, so it is important to establish trust boundary to smooth function of teams. 

Knowledge gaps plays big role in trust assumption that teams or individual make, best way of preventing trust issues are to document it as group that includes developers , QA and business. Once line is drawn then it becomes easy to take decision on different part of system.

The best part of agreeing on trust boundary is that it gives common framework for discussion and also avoids unproductive discussion.

So next time before starting to build any new feature or system , list down how much you trust other system or teams. Create triggers or alerts for every trust boundary violations ,so that you can react when something unexpected happens.

Bugs report also has lots of clue about trust boundary violation, so before fixing it revisit the trust boundary and make adjustment.