Tuesday, 11 August 2020

Speak the language of the domain

In this post I will share about what DSL and how we can create our own.

Domain Specific languages are specialized languages targeted to domain, it has less vocabulary as compared to general purpose languages. DSL is not new concepts it has been around for ages for eg Cascading Style Sheets  is one of the very popular one that we get exposed to everyday, excel spreadsheet macros, SQL,Build tools like make , Ant etc and many tools in unix like awk,sed etc.

XML can be also categorized as DSL but not sure how many people will love that! 

DSL maps problem domain to solution domain

DSL are built using general purpose language and are targeted for specific domain like 

  • Accounting domain
  • Insurance domain
  • Trading domain
  • Testing etc
DSL is about simplification or removing friction required to solve the problem, good DSL will enable all stakeholders of the product to participate in software development.  
 

DSL Categories

Internal


Internal DSL are written in the same language as host language and use host language compiler to produce executable that is executed by language runtime

Internal domain specific language

Internal DSL can be also seen as a library implemented on top of host language. 

External

External DSL is not written in the same host language , it can be written in some scripting language and then parsed and compiled.  

External domain specified language

 Building external DSL is more involved , it has various new components like lexical analyzer , parser , compiler , code generator.

DSL does not have to be textual , for some domains it makes sense to create visual DSL..


Lets gets started

Enough of talking! show me some DSL.

All the examples shown are based on java, although java is not the best language to build DSL but a lot is possible with such a non expressive language.  

Rule engine

Rule engine is good candidate for  DSL because non tech stakeholder of product understand domain  very well, any simplification for business rules creation will helps with collaboration.


This is a general purpose rule engine that allows clients to define rule and action to take when rule is satisfied. In the above example rule engine is used to decide the discount rate for FX transactions.
An Important design goal about any DSL is that it must have vocabulary from domain and it should be expressive. Expressiveness is a subjective thing and different languages have different support, so host language selection plays an important role.  


State machine

State machine is another general purpose thing that most applications use, every mutation in a system can be modeled as a finite state machine.  


This example takes the popular Circuit Breaker Design Pattern and is implemented as a state machine. State machine domain vocabulary contains states ,events, transition rules. In-case of circuit breaker possible valid states are Open,Half Open, Closed and events are connect, re-connect


Testing framework

Testing framework uses DSL extensively to make tests expressive, some of examples of dsl based testing framework are gherkin(BDD) , scala test , jasmine etc.   


In testing domain specification, scenario & assertions are important concepts. 

Trading System

Investment banking system has a complex domain, DSL in this area gives huge dividends because we can engage domain experts (i.e traders) when the system is built.  



Trading system will have Orders, Instrument (i.e Equity , fixed income etc), order type (i.e buy/sell), price, full or partial order.



How to build 

Now we have seen some real examples of DSL, let's discuss on how to building one.

  • Domain is the core thing in DSL, getting a language that is based on domain core concepts is very important.
  • Implicit context - Remembering context in which terms are used plays an important role in expressiveness.
  •  Well designed abstraction - users with beginner level knowledge should also be able to use language.
  • Tools - Having tools like dsl workbench plays an important role in adoption. 

Trade off 

Nothing comes for free and DSL also have some

  • Build complexity.
  • Learning curve for users of language 
  • Exceptions and error handling are not trivial
  • Performance issue.
  • Integration issue
  • Backward compatibility. 

Conclusion

DSL has huge return on investment when domain is complex.

All the code used in post is available @ github

Friday, 10 July 2020

Data encoding and storage

Data encoding and storage format is evolving field, it has seen so many changes starting from naive text based encoding to advance compact nested binary format.

Encoder and decoder
Encoding/Decoding

Selecting correct encoding/storage format has big impact on application performance and how easily it can evolve. Data encoding has big impact on whether application is backward/forward compatible.  
Selecting right encoding format can be one of the important factor for data driven application agility. 

Application developer tends to makes default choice of text(xml, csv or json) based encoding because it is human readable and language agonist. 
Text format are not very efficient, they are take time time/space and also struggle to evolve. If someone care about efficiency then binary format is the way to go. 

In this post i will compare text vs binary encoding and build simple persistent storage that supports flexible encoding.

We will compare popular text/binary encoding like csv , json  , avro , chronicle and sbe



I will use above Trade object as example for this comparison. 

CSV

It is one of the most popular textual format, it has no support for types and makes no distinction between different type of numbers. One of the major restriction is that it only supports scalar types, if we have to store nested or complex object then custom encoding is required. Column and rows values are separated by deliminator and special handling is required when deliminator is part of column value.

Reader application has to parse text and convert into proper type at read time, it produces garbage and is also CPU intensive.

Best thing is that it can be edit in any text editor. All programming language can read and write CSV.


JSON

This is what drives Web today. Majorities of micro services that are user facing are using JSON for REST APIs.
This address some of the issues with CSV by making distinction between string and number, also support nested types like Map,Array, Lists etc. It is possible to have schema for JSON message but it is not in practice because it takes ways flexible schema. This is new XML these days. 
One of major drawback is size, size of JSON message is more as it has to keep key/attribute name as part of message. I have heard in some document based database attribute names takes up more than 50% of the space, so be careful when you select attribute name in json document.  

Both of these text format are very popular inspite of all the inefficiency. Across team if you need any friction less data format interface then go for text based one.

Chronicle/Avro/SBE

These are very popular binary format and very efficient for distributed or trading systems.

SBE is very popular in financial domain and used as replacement of FIX protocol. I shared about it in post inside-simple-binary-encoding-sbe.

Avro is also very popular and it is built by taking lots of learning from protobuffer and thrift. For row based and nested storage this is very good choice. It supports multiple languages. Avro applies some cool encoding tricks to reduce size of message, you can read about it in post integer-encoding-magic

Chronicle-Wire is picking up and i came across this very recently. It has nice abstraction over text and binary message with single unified interface. This library allows to choose different encoding based on usecase. 


Lets look at some number now. This is very basic comparison just on size aspect of message. Run your benchmark before making any selection.





We will try to save above 2 records in different format and compare size.


Chronicle is most efficient in this example and i have used RawWire format for this example and it is the most compact option available in library because it only stores data, no schema metadata is stored. 

Next one is Avro and SBE, very close in terms of size but sbe is more efficient in terms of encoding/decoding operation.

CSV is not that bad, it took 57 bytes for single row but don't select CSV based on size. As expected JSON takes up more bytes to represent same message. It is taking around 2X more than Chronicle.

Lets look at some real application of these encoding. These encoding can be used for building logs , queues , block storage, RPC message etc.

To explore more i created simple storage library that is backed by file and allows to specific different encoding format.

public interface RecordContainer<T> extends Closeable {
boolean append(T message);

void read(RecordConsumer<T> reader);

void read(long offSet, RecordConsumer<T> reader);

default void close() {
}

int size();

String formatName();

}

This implementation allow to append records at the end of buffer and access the buffer from starting or randomly from given message offset. This can seen as append only unbounded message queue, it has some similarity with kafka topic storage.

RandomAccessFile form java allow to map file content as array buffer and after that file content can be managed like any array.

All the code used in this post is available @ encoding github


 

Sunday, 28 June 2020

Ship your function

Now a days function as service(FaaS) is trending in serverless area and it is enabling new opportunity that allows to send function on the fly to server and it will start executing immediately.   

Code as data as code.

This is helps in building application that adapts to changing users needs very quickly.
Function_as_a_service is popular offering from cloud provider like Amazon , Microsoft, Google etc.

FaaS has lot of similarity with Actor model that talks about sending message to Actors and they perform local action, if code can be also treated like data then code can also be sent to remote process and it can execute function locally. 

I remember Joe Armstrong talking about how during time when he was building Erlang he used to send function to server to become HTTP server or smtp server etc. He was doing this in 1986!

Lets look at how we can save executable function and execute it later.
I will use java as a example but it can be done in any language that allows dynamic linking. Javascript will be definitely winner in dynamic linking. 

Quick revision
  Lets have quick look at functions/behavior in java


Nothing much to explain above code, it is very basic transformation.

Save function
Lets try to save one of these function and see what happens. 


Above code looks perfect but it fails at runtime with below error

java.io.NotSerializableException: faas.FunctionTest$$Lambda$266/1859039536 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at faas.FunctionTest.save_function(FunctionTest.java:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Lambda functions are not serializable by default.
Java has nice trick about using cast expression to add additional bound, more details are available at Cast Expressions.

In nutshell it will look something like below


This technique allows to convert any functional interface to bytes and reuse it later. It is used in JDK at various places like TreeMap/TreeSet as these data structure has comparator as function and also supports serialization.
   
With basic thing working lets try to build something more useful.

We have to hide & Serialized magic to make code more readable and this can be achieved by functional interface that extends from base interface and just adds Serializable, it will look something like below


Once we take care of boilerplate then it becomes very easy to write the functions that are Serialization ready.



With above building block we can save full transformation(map/filter/reduce/collect etc) and ship to sever for processing. This also allows to build computation that can recomputed if required.

Spark is distributed processing engine that use such type of pattern where  it persists transformation function and use that for doing computation on multiple nodes. 

So next time you want to build some distributed processing framework then look into this pattern or want to take it to extreme then send patched function to live server in production to fix the issue. 

Code used in in post is available @ faas

Tuesday, 23 June 2020

Bit fiddling every programmer should know

Bit fiddling looks like magic, it allows to do so many things in very efficient way.
In this post i will share some of the real world example where bit operation can be used to gain good performance.

Technology Basics: Bits and Bytes - Business Technology, Gadgets ...
Bit wise operation bootcamp
Bit operator include.
 - AND ( &)
 - OR ( | )
 - Not ( ~)
 - XOR( ^)
 - Shifts ( <<, >>)

Wikipedia has good high level overview of Bitwise_operation. While preparing for this post i wrote learning test and it is available learningtest github project. Learning test is good way to explore anything before you start deep dive. I plan to write detail post on Learning Test later.

In these examples i will be using below bits tricks as building block for solving more complex problem.
  • countBits  - Count number of 1 bits in binary
  • bitParity - Check bit added to binary code
  • set/clear/toggle - Manipulating single bit
  • pow2 - Find next power of 2 and using it as mask.

Code for these function is available @ Bits.java on github and unit test is available @ BitsTest.java

Lets look at some real world problems now.

Customer daily active tracking
 E-commerce company keep important metrics like which days customer was active or did some business. This metrics becomes very important for building models that can be used to improve customer engagement. Such type of metrics is also useful for fraud or risk related usecase.
Investment banks also use such metrics for Stocks/Currency for building trading models etc.

Using simple bit manipulation tricks 30 days of data can be packed in only 4 bytes, so to store whole year of info only 48 bytes are required.

Code snippet


Apart from compact storage this pattern have good data locality because whole thing can be read by processor using single load operation.

Transmission errors
This is another area where bit manipulation shines. Think you are building distributed storage block management software or building some file transfer service,  one of the thing required for such service is to make sure transfer was done properly and no data was lost during transmission. This can be done using bit parity(odd or even) technique, it involves keeping number of '1' bits to odd or even.


Another way to do such type of verification is Hamming_distance. Code snippet for hamming distance for integer values.



Very useful way to keep data integrity with no extra overhead.
Locks
Lets get into concurrency now. Locks are generally not good for performance but some time we have to use it.  Many lock implementation are very heavy weight and also hard to share between programs .In this example we will try to build lock and this will be memory efficient lock, 32 locks can be managed using single Integer.

Code snippet

This example is using single bit setting trick along with AtomicInteger to make this code threadsafe.
This is very lightweight lock. As this example is related to concurrency so this will have some issues due to false sharing and it is possible to address this by using some of the technique mention in scalable-counters-for-multi-core post.

Fault tolerant disk
Lets get into some serious stuff. Assume we have 2 disk and we want to make keep copy of data so that we can restore data incase one of the disk fails, naive way of doing this is to keep backup copy of every disk, so if you have 1 TB then additional 1 TB is required. Cloud provider like Amazon will be very  happy if you use such approach.
Just by using XOR(^) operator we can keep backup for pair of disk on single disk, we get 50% gain.
50% saving on storage expense.

Code snippet testing restore logic.

Disk code is available @ RaidDisk.java

Ring buffer
Ring buffer is very popular data structure when doing async processing , buffering events before writing to slow device. Ring buffer is bounded buffer and that helps in having zero allocation buffer in critical execution path, very good fit for low latency programming.
One of the common operation is finding slot in buffer for write/read and it is done by using Mod(%) operator, mod or divide operator is not good for performance because it stalls execution because CPU has only 1 or 2 ports for processing divide but it has many ports for bit wise operation.

In this example we will use bit wise operator to find mod and it is only possible if mod number is powof2. I think it is one of the trick that everyone should know.

n & (n-1)

If n is power of 2 then 'x & (n-1)' can be used to find mod in single instruction. This is so popular that it is used in many places, JDK hashmap was also using this to find slot in map.



Conclusion
I have just shared at very high level on what is possible with simple bit manipulation techniques.
Bit manipulation enable many innovative ways of solving problem. It is always good to have extra tools in programmer kit and many things are timeless applicable to every programming language.

All the code used in post is available @ bits repo.