Showing posts with label Database. Show all posts
Showing posts with label Database. Show all posts

Saturday, 10 November 2018

SQL is Stream


Stream API for any language looks like writing SQL.

Map is Select Columns
filter is Where
count is Count(1)
limit is LIMIT X
collect is get all result on client side

So it is very easy to map all the functions of Streams API to some part of SQL.

Object relation mapping framework like (hibernate, mybatis, JPA, Toplink,ActiveRecord etc) give good abstraction over SQL but adds lot of overhead and also does not give much control on how SQL is build and many times you have write native SQL.

Image result for i hate hibernate

ORM never made writing SQL easy and if you don't trust me then quick refresh to how code looks .

Sometime i feel that engineer are writing more annotation than real algorithm!

To implement any feature we have to keep switching between SQL API and non sql API, this makes code hard to maintain and many times it is not optimal also.

This problem can be solved by having library that is based on Streams API and it can generate SQL then we don't have to switch, it becomes unified programming experience.

With such library testing will become easy as source of stream can be changed on need basis like in real env it is database and it test it is in memory data structure.

In this post i will share toy example of how library will look look like.

Code Snippet

Stream<StocksPrice> rows = stocksTable.stream();
long count = rows
                .filter(Where.GT("volume", 1467200))
                .filter(Where.GT("open_price", 1108d))
                .count();

Above code generates
Select Count(1) From stocks_price where volume > 1467200 AND open_price > 1108

Look at another example with Limit

stocksTable.stream()
                .filter(Where.GT("volume", 1467200))
                .filter(Where.GT("open_price", 1108d))
                .limit(2)
                .collect(Collectors.toList());

Select stock_symbol,open_price,high_price,trade_date FROM stocks_price WHERE volume > 1467200 AND open_price > 1108.0 LIMIT 2

These API can also use code generation to give compile time safety like checking column names, type etc.

Benefits 

Streams API comes will give some other benefits like
 - Parallel Execution
 - Join between database data and non db data can be easily done using map.
- Allows to use pure streaming approach and this is good when dealing with huge data.
- Opens up options of generating Native optimized query because multiple phase of pipeline can be merged.

This programming model is not new , it is very common in distributed computing framework like Spark, Kafka, Flink etc.

Spark dataset is based on this approach where it generates optimized query like pushing filters to storage, reducing reads by looking at partitions, selective column read etc.

Conclusion

Database driver must give stream based API and this will help in reducing dependency on ORM framework.
This is very powerful programming model and opens up lots of options.

Code used in this post is available @ streams github repo.

Thursday, 24 September 2015

Performance tuning story

Recently i was doing performance tuning of application startup time, it was taking close to 30 min and cold restart was real pain.

I this blog i will share story of this performance tuning experience.

Current state of application
Application load data from database at the startup time and keeps it in memory for fast response time and to make these things interesting all loading happens using multiple threads :-)

Current loading logic is described below.











By looking at above logic it is clear that lock & database query per record must be causing problem and profiling confirmed that.
Above code went through couple of rounds of improvement before it reached to acceptable timing.

Round 1- Remove nested query

In this round per record database query was removed with one query to bring all the data required for record and then per record request were served using that master data set.

So after that changes code looks something like this.

















This gave 30% improvement , that was good starting point with little trade off of extra transient memory.

Round 2 - Reduce scope of lock

Since this code was multi threaded , so this time profiling showed hotspot on lock and way to avoid that is either remove the lock or reduce scope of lock.

Scope of lock was reduced & this allowed to break logic in 2 step

 - Read from database
 - Update cache.

Earlier database query was done after lock was acquired and with new approach it changed and that allowed all parallel request to query the database with no contention on cache.

Code looked something like this













This gave another 40% gain with little more trade off of transient memory but memory was not the issue because oracle resultset only releases memory after it is closed, so memory wise it is no significant difference.

70% of improvement was great but it has more scope, so one improvement was done to make it faster.

Round 3 - Single Writer Batch update
Now all the bottle neck was on "write to cache" step because of multiple writers and it was reduced by using Single writer doing batch update to cache.

db query reader & cache writer were connected using queue, after this change code looked something like this.
















Now lock was acquired only few time and maximum data was written using that lock, this gave around 25% gain.

With above improvement startup time was improved by 95% and it was enough to stop more experiment.

Conclusion
 - Avoid making lots of small query to database in loop.
 - Never do I/O or network call when lock is acquired.
 - Reduce scope of lock.
 - Batch expensive operation