Saturday, 25 April 2020

Immutability is everywhere even in hard disk.



Storage is cheap and it is used as leverage for building many high performance system. If data is immutable then it is safe to share and maintain multiple copy of data for various access pattern.


Many old design ideas like append-only logs, copy of write , Log structure merge tree, materialized views, replication etc is getting popular due to affordable storage.

In software we have seen many example where immutability is key design decision like Spark is based on immutable RDDs, many key value store like leveldb/rocksdb/hbase is based on immutable storage table, column databases casandra are also taking advantage , HDFS is fully based on immutable file chunk/block.

It is interesting to see that our hardware friends are also using immutability. 

Solid State Drive(SSD) is broken in physical blocks and each block supports finite number of writes, each write operation cause some wear and tear to block. Chip designer use feature of wearing level to evenly distribute write load on each block. Disk controller tracks no of write each block has gone through using non-volatile memory.

Wearing level is based copy-on-write pattern which is flavor of immutability.
Disk maintains logical address space that maps to physical block, the block of disk that stores logical address to physical block mappings supports more write operation as compared to normal data block.

Each write operation whether it is new or update is written at new place in circular fashion to even out writes. This also helps in giving write guarantee when power failure happens.

SSD can be seen like small distributed file system that is made of name node(logical address) and data nodes(physical block).

One question that might come to your mind is that what happens to blocks that never changes ? are they not used to the max level of writes ?

Our hardware friends are very intelligent ! They has come with 2 algorithm 

Dynamic wear leveling
Block undergoing re-writing are written to new blocks. This algorithm is not optimal as read only data blocks never gets same volume of write and cause disk to become unusable even though disk can take more writes.

Static wear leveling 
This approach try to balance write amplification by selecting block containing static data. This is very important algorithm when software are built around immutable files.

Immutable design are now affordable and key to building successful distributed systems.

Friday, 17 April 2020

Long Live ETL

Extract transform load is process for pulling data from one datasystem and loading into another datasystem. Datasystem involved are called source system and target system.

Shape of data from source system does not match to the target system, so some conversion is required to make it compatible and that process is called transformation. Transformation is made of map/filter/reduce operations.


To handle the incompatibility between data systems some metadata is required. What type of metadata will be useful ?
It is very common that source data will be transformed to many different shape to handle various business usecase, so it makes sense to use descriptive metadata for source system and prescriptive metadata for target system.

Metadata plays important role in making system both backward and forward compatible.
 
Many times just having metadata is not enough because some source/target system data is too large or too small to fit.



This is situation when transformation becomes interesting. This means some value have to dropped or set to NULL or to default value, making good decision about this is very important for backward/forward compatibility of transformation. I would say many business success also depends on how this problem is solved! Many integration nightmare can be avoided if this is done properly.

So far we were discussing about single source system but for many use case data from other systems is required to do some transformation like converting userid to name , deriving new column value , lookup encoding and many more.

Adding multiple source system adds complexity in transformation to handle missing data , stale data and many more.

As datasystems are evolving so it is not only about relation store today we see key-value store , document store , graph db , column store , cache , logs etc.

New datasystems are distributed also, so this adds another dimension to complexity of transformation.

Our old relational databases can be also described as it is built using ETL pattern by using change log as source for everything database does

One of the myth about ETL is that it is batch process but that is changing overtime with Stream processor (i.e Spark Streaming , Flink etc) and Pub Sub systems ( Kafka , Pulsur etc). This  enables to do transformation immediately after event is pushed to source system.

Don't get too much carried away by Streaming buzzword, no matter which stream processor or pub sub system you use but you still have to handle above stated challenges or leverage on some of new platform to take care of that.

Invest in transformation/business logic because it is key to building successful system that can be maintained and scaled. 
Keeping it stateless, metadata driven, handle duplicate/retry etc, more importantly write Tests to take good care of it in fast changing time.

Next time when you get below question on your ETL process 
Do you process real time or batch ? 

You answer should be 
It is event based processing.


Long live E T L

Friday, 10 April 2020

Testing using mocks

Mock objects are very useful if used right way. I shared some of the experience of using Mock Objects in need-driven-software-development-using post.

What is Mocking in Testing? - Piraveena Paralogarajah - Medium

In this post i share 2 things
- Contract based testing using mocks.
- Patterns to organized mock code.


Contract based testing 
Lets take scenario where you are building Money remittance service. Key component in such type of service is Currency Converter , Bank Service & FX Service.

50000 feet design of fictitious forex service will look something like below.





We have to write FX Service that needs Currency converter & Bank Transfer service. This is perfect scenario for contact based testing

Code snippet for FXService



Our new FX service has to follow below contract
  • Interact with currency converter & Bank Transfer based on input/output contract.
  • Makes 1 call to each of service.

One way to test FX service is to call the real service but that means slow running test and dependency on service that it has to up whenever our test is executing. Sometime calling real service is not an option because it is not developed yet. 

Smart way is to mock these collaborator( Currency Converter & Bank Transfer) and verify interaction using mocking framework.
Another advantage of testing with mocks that it enables to verify that both currency & bank transfer service are used by fxservice in expected way.

Lets look at mock based test.



This test is written using EasyMock framework and is mocking reply from collaborators.   

Write the test that you want to read

One of the important property of good test is that it is enjoyable to read. 
Mocks can make this goal more difficult to achieve as setup code for unit test will have very complex assembling logic that will be mix of some normal object set and some mocking expectation. I am sure you have seen before function in test that is used as dumping ground for setup required for all the tests in class. 

Lets look at some mock code we used earlier and try to improve it


Another way

Both of the above code is doing same thing but later one which is written with jmock has nice sugar method to express same thing.
This helps in keeping expectation clean and in context with code that is being tested. Collaborator object in the context are mocked out.

Simple pattern but very effective in making test readable.

Code used in this post is available on github

Thursday, 19 March 2020

Incremental build with maven

This is 2020 and if you are starting any new java based project then gradle should be first option but for some reason if you are still stuck with Maven then you might find this post useful.

Image result for maven cartoon

Maven java/scala compiler plugin has decent support for incremental compilation but it is not able to handle few edge case like

  • Trigger compilation when file is deleted from source folder.
  • Skip unit test when code is not changed.

Just to handle deleted file scenario most of the time we have to run "mvn clean install" and that means full code is complied and unit test are executed. 

Compilation of scala code is slow and if project contain slow running test like starting webserver , spark context, IO etc then this becomes more worse. In many case wait time could be minutes.
I am not accounting for wasted CPU cycles for running test even when code is not changed.

As an experiment i took some ideas from Gradle and wrote add-on maven plugin that handles above stated issues by

 1. Cleaning target location when code is changed and trigger full build.
 2. Skip unit test execution when code is not changed.

Both of the these features can help in reducing compilation time significantly because most of the time only few modules are changed and previous build output can be used. You can get blazing fast builds by enabling this plugin.

How to use plugin
This plugin is added at pre-clean stage, add below entry to pom.xml and use "mvn pre-clean install" 

<plugin>
                <groupId>mavenplugin</groupId>
                <artifactId>compilerplugin</artifactId>
                <version>1.0-SNAPSHOT</version>
                <executions>
                    <execution>
                        <id>pre-clean</id>
                        <phase>pre-clean</phase>
                        <goals>
                            <goal>inc</goal>
                        </goals>
                    </execution>
                </executions> 
            </plugin> 


Plugin code is available @ compilerplugin github repo

sandbox code using plugin is available @ compilerplugintest github repo

Conclusion
Always collect metrics on build like how long it takes to compile , time taken by test , package size, dependency etc. Once you start measuring then you will notice how slow builds are and that also need same love as code.

Fast build is first step that enable continuous delivery.

Saturday, 14 March 2020

Hands on Optional value

Optional is in air due to coronavirus, everything is becoming optional like optional public gathering , optional work from home, optional travel etc.

Image result for option scala

I though it is good time to talk about real "Optional" in software engineering that deals with NULL reference.

Tony Hoare confessed that he made billion dollar mistake by inventing Null. If you have not seen his talk then i will suggest to have look at Null-References-The-Billion-Dollar-Mistake.

I will share some of the anti pattern with null and how it can be solved using abstraction like Optional or MayBe.

For this example we will use simple value object that can have some null values.


This value object can have null value for email & phone number.

Scenario: Contact Person on both email and phone number

Not using optional
First attempt will be based on checking null like below

This is how it has been done for years. One more common pattern with collection result.

Use optional in wrong way

This is little better but all the goodness of Optional is thrown away by adding if/else block in code.

Always Happy optional


It is good be happy but when you try that with Optional you are making big assumption or you don't need optional.

Nested property optional
For this scenario we will extend Person object and add Home property. Not everyone can own home so it is good candidate that it will be not available .
Lets see how contacting person scenario work in this case

This is where it start to become worse that code will have tons of nested null checks.

Priority based default
for this scenario we first try to contact person on home address and if it is not available then contact on office address.
Such type of scenario require use of advance control flow for early return and makes code hard to understand and maintain.

These are some of the common pattern where optional are not used or used in wrong way.

Optional usage patterns
Lets look at some of good ways of using optional.

Make property optional based on domain knowledge
It is very easy to makes property optional.

Yes it is allowed to make get Optional, no one will hang you for that and feel free to do that without fear. Once that change is done we can write something like below

It looks neat, first step to code without explicit if else on application layer.

Use some power of Optional

Optional is just like stream, we get all functional map,filter etc support. In above example we are checking for OptIn before contacting.

Always happy optional
Always happy optional that calls "get" without check will cause runtime error on sunday midnight, so it advised to use ifPresent

Nested Optional

Flatmap does the magic and handles null check for home and convert  insurance object also.

Priority based default

This example is taking both home & office address and pick the first one that has value for sending notification. This particular pattern avoids lots of nested loops.

Else branch
Optional has lots of ways to handle else part of the scenario like returning some default value(orElse) , lazy default value (orElseGet) or throw exception(orElseThrow).

What is not good about optional
Each design choice has some trade off and optional also has some. It is important to know what are those so that you can make careful decision.
Memory overhead
Optional is container that holds value, so extra object is created for every value. Special consideration is required when it holds primitive value. If some performance sensitive code will be impacted by extra object creation via optional then it is better to use null.

Memory indirection
As optional is container , so every access to value need extra jump to get real value. Optional is not good choice for element in array or collection.

No serialization
I think this is good decision by Jdk team that does not encourage people to make instance variable optional. You can wrap instance variable to Optional at runtime or when required for processing.

All the example used in post are available @ optionals github repo

If you like the post then you can follow me on twitter.

Monday, 20 January 2020

Better product by documenting trust boundary

Have you ever faced issue when you trusted system, team or product and that resulted in failure of feature or system?




In fast changing progressive delivery, definition of trust keeps on changing. Some of the trust issues that happens are.
  • Other system sends junk data.
  • Other system does not maintain constraint of data like unique , null , referential integrity.
  • Third party library causing unknown side effect.
  • Nitpicking user or QA.
  • Ignorant users.
  • Demanding product owner.
As engineer you have to continuously revisit trust boundary otherwise someone else will find gap and exploit system. Trust issues happens between teams or within single team, so it is important to establish trust boundary to smooth function of teams. 

Knowledge gaps plays big role in trust assumption that teams or individual make, best way of preventing trust issues are to document it as group that includes developers , QA and business. Once line is drawn then it becomes easy to take decision on different part of system.

The best part of agreeing on trust boundary is that it gives common framework for discussion and also avoids unproductive discussion.

So next time before starting to build any new feature or system , list down how much you trust other system or teams. Create triggers or alerts for every trust boundary violations ,so that you can react when something unexpected happens.

Bugs report also has lots of clue about trust boundary violation, so before fixing it revisit the trust boundary and make adjustment. 

Friday, 15 November 2019

Complexity Accidental vs Essential

Today it is hard to find team or organization that is not following agile but building software has not become easy, projects are missing schedule , over budget and it is also flawed.

Image result for software complexity"
Why it is so hard to build software ?

If you ask this questions to any engineer then 90%+ will say requirement, but is that the full truth ?

Lets try to decompose software construction. Every feature has 2 important component that decides whether feature will be successful or not.

  • Essential complication (ec)
  • Accidental complication( ac)


We will do some Functional programming refresher.

Feature = f( Essential Complication) + f(Accidental Complication )

Essential complication comes from domain like if you are building software for medical industry then it is complex. Accidental complication is complexity added by engineers, process & management to build feature.

Essential complication are hard to reduce because of domain, but to some extent it is possible to reduce by using good decomposition techniques. Decomposition is hard skill and comes from experiment of some fail projects.

Accidental complication can be controlled but it is not linear function, accidental complication is not same in every part of system and gets more complex over time. This also gives feedback on how much bad job we have done as engineer or product team.
Complexity comes in various forms like communication in team, less understanding , difficulty reusing some feature , extending program to new function, management problems etc.

Now we know accidental complication is exponential, so lets write formula again.

Feature = f( Essential complication) + f(Accidental complication * Unknown)) 

Now it will become clear why something takes many times longer than estimate or guess. Product owner also has part to play in adding accidental complication by marking assumption on importance of feature.

What can be done ?
If we need some predictability or consistence in delivery then we have to continuously work on reducing accidental complexity. Lets look at ways to keep to keep this in control.
  • Using higher level languages.
  • Incremental development by growing the software not building it.
  • Good buy vs build decision. 
  • Unified programming environment. 
  • Raid prototype to refine requirement.
  • Listen to design pressure.
  • Test driven development.
  • Stop "Get it out of the door" mindset.
  • Reduce "surgical strike effort" in delivery.
Very insightful quote from Frederic Brooks, both customer and engineer has to learn what to ask, expect, and commit otherwise only option is broken system.

“An omelette, promised in two minutes, may appear to be progressing nicely. But when it has not set in two minutes, the customer has two choices—wait or eat it raw. Software customers have had the same choices. The cook has another choice; he can turn up the heat. The result is often an omelette nothing can save—burned in one part, raw in another.”
― Frederick P. Brooks Jr., The Mythical Man-Month: Essays on Software Engineering


Mythical-Man-Month by Mr Brooks is must read for every product owner , project manager and engineer.

If you like the post then you can follow me on twitter.