Follow by Email

Sunday, 8 September 2019

Tracer bullet software development

Software development methodology is evolving very fast and every team has found version that works well in current context.

Software development methodology is going through continuous improvement.

The_Pragmatic_Programmer book talks about Trace bullet approach and as per that it comes down to feedback. The more quickly you get feedback, less change is required to hit the target.

I have used Tracer Bullet approach through out my career with good success.

I will share another approach or mindset of software development that can be used in some scenario to build better product and create value.

Explore or Discovery 
Image result for discovery images

Many time as team you have to find what is the next feature you should be building that will help in increasing adoption of product.
So i call this as Explore or Discover phase and this phase of development has very different goal and trade off. This phase means that you have to move fast, cut some corners to get feedback. This phase will not have enough tests, documentation, code quality is not good etc.

Important thing about this phase is you are actively collecting feedback on whether this is the next big idea your team will be investing.
In this phase you have to make sure you gain more than loose, so timebox discovery phase to keep track on resource that is consumed.

Outcome of successful discovery puts you in Growth phase and in many case this will be steep growth, but every unsuccessful outcome has lots of signal on next exploration.

 You Build to learn in this phase and once you learned then move to next phase or start another discovery.

Growth Phase
Image result for growth images

This phase is outcome of successful discovery and now you have found the next feature that market or your target audience need.
Trade off for this phase is very different from Explore phase, you have to stabilized feature , do changes based on feedback so users who have shown interest in idea are still engaged. Users who got on-boarded work like your sponsor, so keep them in loop.

Interesting things about this phase is your team will be in war room type of situation, everyone is trying to get over huddles and get feature out.

Word of caution this phase very intense and demanding. This phase is the real "Sprint" phase not the agile sprint! putting extra hours has good returns.
Another thing to watch out is to be persistent in exploiting maximum of new feature but many time team drops the ball and go back to discovery phase. 

I would say this phase also puts design pressure on team and you can refer to design-pressure-on-engineering-team post that talks about it.

Successful outcome of this phase is Expand phase. Team is exhausted after this phase but very motivated.

Expand Phase
Image result for expand images

Welcome to phase that requires building software in the way we learn in text book, this is the phase where engineering discipline are very important because solution has to be scalable, maintainable , reliable etc.
Now you can go to management and ask for more funds to get servers , expand team etc because this idea will generate some profit.

Don't build product like discovery phase in this stage otherwise you will become victim of your own success, below image should be good example of it.

Image result for victim of your own success

Conclusion
All of these phases has very different constraint and trade off and it is very important to know that, expand phase can't be managed like explore phase or vise versa and i have also found that engineering disciplined are very very different in these phase, so you need team that is aligned with mindset. If you get your team wrong then it can really very challenging to execute phase.

It is fascinating to see that our industry keeps on evolving and new development methodology are found.

It is much more than Agile or scrum.

I want to end with quote
As engineering team we should do continuous exploration and exploration is not linear process.

Some of other post on software development that you might interesting

need-driven-software-development-using
broken-promise-of-agile
cargo-cult-innovation-center

I will be happy to learn about new ways of building software, so please share it!

If you like the post then you can follow me on twitter.

Saturday, 24 August 2019

frontend vs backend development

I am sure you might have got into discussion of frontend vs backend development or some engineers want to do only one type of development.


Image result for frontend vs backend

I am passionate engineer for more than decade and have got opportunity to see full spectrum.  I have done development on extreme end of both the side.

In this post i will share something on front-end engineering that is hidden and things that is not discussed openly.

Lets start ..

Back-end 
In this part of the work every one understands that it is a world of data structure , algorithm and writing bare metal code, handling scale etc.


Front-end
In this part of world most of us think is only about building engaging and secure User Interface, but i will say it is just one part of it.
As a frontend developer you are exposed to so many engineering challenges. Lets discuss about it.

Getting started
For any backend related task you can write the quick code or that code can come from Stackoverflow and go and run from IDE.

Frontend getting code is just small part but after that you need some webserver/container to host the code and then know the browser/client that you want to use and then finally code runs.

This is just small example and you can get idea of extra number of steps required to see your code running.

Synchronous vs Asynchronous 
On backend system you have option to choose if code is Sync vs Async but on frontend most of the operations has to be asynchronous otherwise end user experience is very bad.

On frontend you are exposed to this on day one but on backend it will take months or year before you get to state where start thinking about sync vs async .

I am sure if you have done any concurrent programming then you know how hard it is to coordinate async tasks.

Distributed Computing
 Now distributed computing has become so common that now it is hard to think of system that is not making distributed calls.

As a backend developer you are guarded or gets late exposure to distributed computing but on frontend every call to get data is remote call, so you have to aware about failures that can happen when remote call is made and slowness it adds.

Proper error handling becomes optional in backend system but on frotnend it is not the option because user will noticed it and complain about it.

Frontend is the last gate so it has to handle all the errors that are thrown or  suppressed by backend systems, so end user experience is smooth.

You experienced distributed computing very early once you are on frontend side.

Network 
We read many text book that "network is not reliable" and as a backend engineer you don't get directly exposed to this because library and framework handles it for you but on frontend you get first hand experience to deal with and come up with strategy

All the backend application gets benefit of fast network ("100 GBs network") because it runs in data center but for frontend application is data center is end user device which will be browser/handheld device.
Many time network is dial up( i.e KBs) and application has to work on slow network.

Compute and In-memory 
This one is interesting because when backend program is slow first option is increase compute or memory because elastic infra allows to do that that but on end user side no elasticity, so first option is no option for front-end friends.

Approach taken to optimize frontend is very creative and innovative as compared to backend. This also put design pressure on front application.

Algorithm 
Backend systems has more options on algorithm that can be used to solve program for example Disk based algorithm are very common for many data intensive backend application but on frontend side this option is not available or in very limited way and you have to very creative in how do you use it.

I think many chapters of algo book is Not Applicable for frontend.

Patterns
On frontend side industry is inventing new patterns every day but backend side does not move at that pace for eg functional composition , incremental rendering , state management using immutable DS , event driven systems, chunking of requests , late arrival of information due to slow network etc


Artifact/Packaging
Backend system are never seen from lens on how big jar/exe/dll is but this is first challenge to be solved on front-end side because package must be small so that it can be downloaded quickly by clients.
Network play role in this remember 100GBs vs Kbs ?

Requirements
This can be little controversial but in many case frontend are built with fuzzy or no requirement and later requirement is added. Requirement is must for backend! 

Testing
This is the hardest part for frontend. I will leave this for now because it needs multi series blog just for this topic.

Conclusion

I know it might look like i am just trying to sell frontend development is more complex but my point is you become better engineer when you move to frontend.
If you are not doing any frontend then find way to do that or learn about these hard problem from frontend engineers and incase they say "i don't think about these challenges" then educate & help them.

If you like the post then you can follow me on twitter .





Saturday, 17 August 2019

JVM with no garbage collection

JVM community keeps on adding new GC and recently new one was added and it is called Epsilon and is very special one. Epsilon only allocates memory but will not reclaim any memory.

Image result for garbage collection

It might look like what is use of GC that does not perform any garbage collection. This type of Garbage Collector has special use and we will look into some.

Where this shinny GC can be used ?
Performance Testing

If you are developing solution that has tight latency requirement and limited memory budget then this GC can be used to test limit of program.

Memory Pressure Testing
Want to know extract transient memory requirement by your application. I find this useful if you are building some pure In-Memory solution.

Bench marking Algorithm.
Many time we want to test the real performance of new cool algorithm based on our understanding of BIG (O) notion but garbage collector adds noise during testing.

Low Garbage
Many times we do some optimization in algorithm to reduce garbage produced and GC like epsilon helps in scientific verification of optimization.

How to enable epsilon GC

JVM engineers have taken special care that this GC should not enabled by default in production , so to use this GC we have to use below JVM options

-XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xlog:gc


One question that might be coming in your mind what happens when memory is exhausted ? JVM will stop with OutofMemory Error.

Lets look at some code to test GC

How to know if epsilon is used in JVM process?

Java has good management API that allows to query current GC being used, this can also be used to verify what is the default GC in different version of java.

Run above code with below options
-XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC VerifyCurrentGC

How does code behave when memory is exhausted. 

I will use below code to show how new GC works.

Running above code with default GC and requesting 5GB allocation causes no issue (java -Xlog:gc -Dmb=5024 MemoryAllocator) and it produces below output

[0.016s][info][gc] Using G1
[0.041s][info][gc] Periodic GC disabled
Start allocation of 5024 MBs
[0.197s][info][gc] GC(0) Pause Young (Concurrent Start) (G1 Humongous Allocation) 116M->0M(254M) 3.286ms
[0.197s][info][gc] GC(1) Concurrent Cycle
[0.203s][info][gc] GC(1) Pause Remark 20M->20M(70M) 4.387ms
[0.203s][info][gc] GC(1) Pause Cleanup 22M->22M(70M) 0.043ms
[1.600s][info][gc] GC(397) Concurrent Cycle 6.612ms
[1.601s][info][gc] GC(398) Pause Young (Concurrent Start) (G1 Humongous Allocation) 52M->0M(117M) 1.073ms
[1.601s][info][gc] GC(399) Concurrent Cycle
I was Alive after allocation
[1.606s][info][gc] GC(399) Pause Remark 35M->35M(117M) 0.382ms

[1.607s][info][gc] GC(399) Pause Cleanup 35M->35M(117M) 0.093ms
[1.607s][info][gc] GC(399) Concurrent Cycle 6.062ms

Lets add some memory limit ( java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xlog:gc -Xmx1g -Dmb=5024 MemoryAllocator)

[0.011s][info][gc] Resizeable heap; starting at 253M, max: 1024M, step: 128M
[0.011s][info][gc] Using TLAB allocation; max: 4096K
[0.011s][info][gc] Elastic TLABs enabled; elasticity: 1.10x
[0.011s][info][gc] Elastic TLABs decay enabled; decay time: 1000ms
[0.011s][info][gc] Using Epsilon
Start allocation of 5024 MBs
[0.147s][info][gc] Heap: 1024M reserved, 253M (24.77%) committed, 52640K (5.02%) used
[0.171s][info][gc] Heap: 1024M reserved, 253M (24.77%) committed, 103M (10.10%) used
[0.579s][info][gc] Heap: 1024M reserved, 1021M (99.77%) committed, 935M (91.35%) used
[0.605s][info][gc] Heap: 1024M reserved, 1021M (99.77%) committed, 987M (96.43%) used

Terminating due to java.lang.OutOfMemoryError: Java heap space

This particular run caused OOM error and is good confirmation that after 1GB this program will crashed.

Same behavior is true multi thread program also, refer to MultiThreadMemoryAllocator.java for sample.

Unit Tests are available to test features of this special GC.

I think Epsilon will find more use case and adoption in future and this is definitely a good step to increase reach of JVM.

All the code samples are available Github repo

If you like the post then you can follow me on twitter .


Wednesday, 14 August 2019

Need driven software development using Mocks

Excellent paper on mocking framework by jmock author. This paper was written in 2004 that is 18 years ago but has many tips of building maintainable software system.

Related image

In this post i will highlight key ideas from this paper but suggest you to read the paper to get big ideas behind mocking and programming practice.

 Mock objects are extension of test driven development.

Mock objects can be useful when we start thinking about writing test first as this allows to mock parts that is still not developed. Think like better way of building prototype system.

Mock object are less interesting as a technique for isolating tests from third-party libraries.

This is common misconception about mock and i have seen/written many codes using mock like this. This was really eye opening fact that comes from author of mocking framework.

Writing test is design activity

This is so much true but as engineer we take shortcut many time to throw away best part of writing test. Design that is driven from test also gives insights about real problem and it lead to invention because developer has to think hard about problem  and avoid over engineering

Coupling and cohesion 

As we start wiring test it gives good idea on coupling & cohesion decision we make. Good software will have low coupling and high cohesion. This also lead to functional decomposition of task.
Another benefit of well design system is that it does not have Law_of_Demeter, this is one of the common problem that gets introduced in system unknowingly. Lots of micro services suffer from this anti pattern.

Need driven development
As mocking requires explicit code/setup, so it comes from need/demand of test case. You don't code based on forecast that some feature will required after 6 months, so this allows to focus on need of customer. All the interfaces that is produce as result of test is narrow and fit for purpose. This type of development is also called top down development.

Quote from paper
"""
We find that Need-Driven Development helps us stay focused on the requirements in hand and to develop coherent objects.
"""


Programming by composition

Test first approach allows you to think about Composability of components, every thing is passed as constructor arguments or as method parameter.
Once system is build using such design principal it is very easy to test/replace part of system.
Mock objects allows to think about Composability so that some parts of system are mocked.

Mock test becomes too complicated
One observation in paper talks about complexity of Mock Test.
If system design is weak then mocking will be hard and complicated. It does amplification of problems like coupling, separation of concern.  I think this is best use of mock objects to get feedback on design and use it like motivator to make system better.

Don't add behavior to mock
As per paper we should never add behavior to stub and in case if you get the temptation to do that then it is sign of misplaced responsibility.

If you like the post then you can follow me on twitter to be notified about random stuff that i write.


Thursday, 25 July 2019

Exception handling

In this post i will share how error handling is done and what options we have.Error handling is complex topic :-)

Image result for error handling

I will add some context from wikipedia on what is exception handling before going down the rabbit hole of exception handling
Exception handling is the process of responding to the occurrence, during computation, of exceptions – anomalous or exceptional conditions requiring special processing – often disrupting the normal flow of program execution. It is provided by specialized programming language constructs, computer hardware mechanisms like interrupts or operating system IPC facilities like signals.
In general, an exception breaks the normal flow of execution and executes a pre-registered exception handler. The details of how this is done depends on whether it is a hardware or software exception and how the software exception is implemented. Some exceptions, especially hardware ones, may be handled so gracefully that execution can resume where it was interrupted.
Source - https://en.wikipedia.org/wiki/Exception_handling 

Few things that is highlighted are "disrupting the normal flow of program execution" , "pre-registered exception handler" , hardware or software exception.
So it explain what error handling is , so i will not spend time on that.

One interesting thing mention is 2 types are exception exists( hardware & software), hardware friends have handled very gracefully and it on software engineer to do part.

Software one are the hard and too many programming languages makes it even harder.
I want you refer to simple-testing-can-prevent-most post on which i try to explain the side effect of wrong error handing and it pure as the result of exception handling pattern.


C way
I am sure if you seen this and have thought that "is this the right way ?".
Code snippet of C error handling



This approach has several issue 
 -  You have to check for error after calling every function that can fail. 
 -  No safety from compiler on forcing/indicating that error can be thrown at this point
 - Error handling is completely optional


Java Way

Then came java and came with mindset let me fix all the error handling and they invented checked/unchecked exception.
Look at code snippet



This approach has every more issues
- Code is full with verbosity of error handling code
- Compiler forces you to handle checked exception in wrong way(i.e just log it or ignore it)
- Nothing meaningful is done apart from log something and wrap it in RuntimeExcetion to get passed compiler.
- Wrapping makes things worse because you start loosing context on what caused error

Functional Programming Way
This world has to do better than "imperative" world and what they did ?  Invented Monads.


Things to consider when using this approach
- You have to learn fancy technical jargon of Category theory or monads
- Now gives 2 value and you have to write little less verbose code to handle both the path
- Performance issues due to extra wrapping of value and when you have millions of them then it hits you very hard
- No compile time safety , caller have an option to get around this by directly getting the value
- and i think this was attempt to fix Optional/MayBe value, in which you don't know why value is not available.
- Stacktrace is gone and in some case it is useful especially building system that is calling third party libs 


Go Lang Way
GoLang wanted to do better than C/Java and took some inspiration from Elm language and came up with delegation or Killer(i.e Panic) approach


This is interesting approach but 
-  With err return in very function call , caller has to add error handling code
- Trace is lost, so you have to very careful in adding all the context to message so that recovery is possible.
- Panic is not good for library or framework because you can't kill the process , it has to be client responsibility to decided on what to do.


JavaScript/Python way
I will leave this for now.


No clear winner in which option is best and each language is doing some trade-off.We don't exceptions like java ans also like Go Lang, it is 2 end of the pendulum.

What could be good is having caller option to decided on what approach to use it could be Java Style or Go Lang.
Better way to separating control flow and error because in some case default value on error could be good option or just cleanup/recovery or send to upper layer to better handing.

So code in catch block tell lot about what client want and that should decided what error handling pattern you should use. I think it is more about education on what is right in context and use the pattern.

Happy to know about what you think about error handling and how it should done.

Sunday, 28 April 2019

Cargo Cult - Innovation Center

Cargo Cult is serious problem in Software industry. I think Innovation center is example of cargo cult.
In this post i will share my views on innovation center.

Cartoon: man carrying ideas sees that door to Innovation Centre is closed.
Innovative Idea killer

People think Innovation center is cool place to work but many thing or almost every thing done at innovation center fails.

Why innovation center is open ?

It is hard to innovate at current place because of process , ceremony, approval & permission etc, so what company/team do is "Lets open Innovation labs" or "Innovation initiative, rather than fixing real problem that are causing friction this new innovation thing is started to feel better.
Teams are changing the way they behave not how they think.

This labs are nothing more than expensive press release stunts and it adds no value.

No alignment with product road map
It does not align with product road map or big mission and it runs on parallel/side track that is not going to meet to main track.

It is new, interesting , shiny but never scales and finally team does tell "How do we launch this ?"

Thinking of "New" . It has to new thing , new brand , new experience  , new tech stack etc. This new thing will never fit in old (i.e product roadmap), so no one wants it.

Working for wrong customer
Work for board of director or executive that sings the cheque for these labs. Team starts working for wrong customer (i.e executive or press) and ignore the real customer.

Unfulfilled dream
You get all the creative people for these labs and very soon they have unfulfilled dream that nothing gets used by real customer. Team gets frustrated and burn out.
It becomes mental gymnastics that goes no where and to make it worse learning/failures are not passed to real delivery team that will make it production ready.

Too much of freedom
Balance is missing in innovation labs and they always run on exploration/experiment mode and does not get close to exploiting the learning.
Too much process makes you slave and no process/structure makes free fall. Finding right balance is important.

Image result for balance innovation 

No credibility 
Very soon innovation team loose all the credibility and people ask question like
 - Why are they doing this ?
 - What are they producing ?
 - Why they don't come and talk to tech teams ?
innovation team moral is crushed .

Idea development is not inclusive

Any product that is successful in market needs 3 things it should be feasible, desirable & profitable.
Feasible is where tech team comes in and confirms that it is feasible to build the product.
Desirable is design team that confirms whether any one want it or not.
Profitable is product team looking from financial/brand gain.

In most of the ideas driven by innovation team all 3 groups are not partner and in case if they are then one of them is dominating due to which needle is not moved in right direction.

Change innovation as spectacle to innovation as strategy to build better products. 
Keep innovating :-) 

Sunday, 10 February 2019

Adaptive scheduling of Spark Job using YARN API

In last blog poorman spark monitoring i shared approach on how to figure out how long Spark Job is waiting for resource.

This post covers some more details on how to be proactive when Spark Job is stuck due to resource constraint.

Little recap of Yarn.
Apache Yarn is resource management and job scheduling framework for hadoop distributed processing framework.

MapReduce NextGen Architecture

Hadoop cluster are shared between teams and for proper utilization of cluster teams/projects are allocated some capacity of cluster.

One of the popular scheduling approach is "Capacity Scheduler" for multi tenant cluster and it is based on Queues.
Yarn allows to define min & max resource for Queue and it is hierarchical, it looks something like below

Capacity Scheduler


One of the issues that can happen in Capacity Scheduler is that your job is submitted to overloaded queue and it gets stuck in Accepted state for long time although other queues has some capacity which is just left unused.
Another common issue is Job started running but did not got all the resource(cores/memory) and will run forever because queue is overloaded.

Yarn gives REST API to query state of cluster/queues/application and that can be used to solve issues where resource is available in cluster but application is not using it :-)

Yarn API to build adaptive job submission
Yarn API comes very handy in solving both of the above issue, some of the strategy using yarn API.

 - Submit job to queue that has capacity.
This type of strategy will select queue at run-time and submit application to least loaded queue.

 - Move Job to queue that has capacity.
This type of strategy will monitor job status and if it is not moving or get stuck in "Accepted" state then will move it to queue that has some capacity.

Abstraction of Yarn API to get minimum details that will allow adaptive job submission.

Once we get all the metrics required for making decision then it becomes straight forward to submit/move the job.
Below code snippet try to move the job based on simple strategy of max wait time for Accepted status App.

Yarn exposes lots of metrics that can used to building adaptive system. You can refer to ResourceManagerRest for full set of API.

Word of caution that be fair when you are using this strategy, don't use whole cluster alone.
Image result for greedy


Code used in post is available @ yarn github project

Sunday, 3 February 2019

Golang control statements

We are going to explore Go lang control structure, this is not covering all the control statement but you can refer control-structures from effective go to get all the details.

Refer to index page for all the content written so far.

One thing that i like about control structure is that it is very easy to understand.
Focus on readability is clearly seen .


If statement
Go lang author managed to removed extra bracket in if statement, it looks something like 

if x > 10 {
fmt.Println("I am gt ", x)
} else {
fmt.Println("I am lt ", x)
}

Another variation that includes initialization and condition both  

if value := time.Now().Weekday(); value == time.Sunday {
fmt.Println("Yahoooo.. today is sunday")
} else {
fmt.Println("Lets get back to work. I hate", value)

}

Switch Statement
Switch case has few variations 

Simple one
value := 10
switch value {
case 10:
fmt.Println("Value is 10")
default:
fmt.Println("Some other value than 10")

}

With No expression
switch {
case value >= 10:
fmt.Println("Value is gt 10")
case value >= 20:
fmt.Println("Value is gt 20")

}

Switch with multiple condition in single case

specialValue := '@'
switch specialValue {
case '@', '!', '#':
fmt.Println("This is special value")
default:
fmt.Println("This is normal value")

}

Switch with type assertion 
Type assertion can be only done using switch case using variable.(type) expression.

var t interface{}
t = "James"
switch t.(type) {
case int:
fmt.Println("Int value", t)
case string:
fmt.Println("String value", t)

}

Loops
Has only one type of loop(while) and it can be used for all the purpose.

C/Java like
It has init, condition,post section.

value := 0
for counter := 0; counter < 10; counter++ {
value++
}

fmt.Println(value)

Just condition
value = 0
for value < 10 {
value++
}

fmt.Println(value)

Infinite (with no condition)

for {
value++
if value > 10 {
break
}

}

Smart loops
This is useful when dealing with arrays/map/channels

days := []string{"Sunday", "Monday"}
for index, value := range days {
fmt.Println("Index ", index, "Value ", value)

}

range keyword is very power full it works with all the collections types.
Another thing i like about golang is that compiler helps with lot of common error for e.g unused variable are compiler error, so below example is error because index is not used.

days := []string{"Sunday", "Monday"}
for index, value := range days {
fmt.Println("Value ", value)

}

It is possible to ignore the value by using "_" for eg

days := []string{"Sunday", "Monday"}
for _, value := range days {
fmt.Println("Value ", value)

}

Sample used in this post is available @ 003-statement github repo

Saturday, 26 January 2019

Poorman Spark monitoring

Spark exposes lots of metrics to get insights on what is happening inside Spark Application but some time you are looking for quick metrics on spark application.

In this post i will share example of some metrics that can be collected quickly using simple pattern.

How long my spark application is waiting for resource allocation ?
I always felt need of this metrics when running in shared cluster with limited capacity allocated to user.
This metrics is useful to know when spark job is stuck because cluster is busy.

Pattern is very simple start timer thread that monitor spark context creation and logs time at regular interval.

Code snippet for monitor code

Just start timer before SparkContext is created using below code

 monitoringExecutor.submit(newCallable(checkSparkContext))

How many records spark job/stage is processing ?

This is based on pattern that we need distributed counter to track how many records are processed by stage.
Spark has something called LongAccumulator that an be used for capturing metrics like this.

So we need block of code that is just logs value of accumulator and takes some action if it is not moving fast.
Code snippet for tracking records processed.

monitorRecordsProcessed is submitted for async execution and processData in map function will increment counter.

Note about accumulator that these are shared variables between driver & executors. If accumulator are written on executor side then there is chance of multiple/double writes due to retry of failed stages.
So just take that in account when dealing with Accumulator , they are very good as debugging tool or giving interactive feedback of processing but it can contain some noise when jobs/stages are failing.

Spark is using Accumulator to tracking stage internal metrics and all that is available on Spark UI.

How do we get access spark internal metrics ?

Now we are getting in Rich man monitoring. Lets look at example that gives access to internal accumulators and also exposes API to get all the metrics during job execution.

Below code logs all the accumulators at stage level, only rule is give name to accumulator so that it is available as internal spark metrics.

Once listener is defined then just add it to sparkContext

sparkSession.sparkContext.addSparkListener(new StageAccumulatorListener)

This listener will start showing all the accumulators, sample of logs

Record counter also come in this list because it was named value.

Spark gives API to get access to metrics during execution and it can be used to build proactive monitoring system.

All the code used in this post is available on poormonitor github



Sunday, 20 January 2019

Value of pass by value in GoLang

Now we are getting in some of the core concepts! Knowing this is very important to understand impact Go program will have on machine.

Everything is pass by value in Go, no matter what you pass. This also has what is you see is what you get.


Each go routine(i.e path of execution) get Stack, which is continuous memory. Go routine needs stack to do the all the allocation required. We will learn about go routine later but it is just like thread but much more lighter.


As go routine execute function it starts getting slice or portion of memory from stack that was allocated.


Lets try to understand with simple example


func main() {

counter := 0

counter++
fmt.Println("In main", counter)
inc(counter)
fmt.Println("After inc", counter)
}


Stack frame state when inc is executing

Stack Frame

Function can only read/write to its stack frame, that is the reason why function parameters are required.
With above example any change done by inc function is local to that stack frame and if it wants to share it to caller then it has to return it, so that value can be copied to caller frame.

Another interesting properties about stack frame is that it is reusable for eg after inc function completes execution that stack frame is available to another function.
So it is like increment pointer in stack to allocate memory to function and once that function completes then decrements the counter to mark memory as free.

Pass by value is required for safety and to reason about code which is missing in many language.

Lets explore how all this changes when pointer or address of variable is passed to function.

Lets try to understand how stack frame looks when below code is executed

 func main() {

counter := 0

fmt.Println("Before pointer inc ", counter)
 incByPointer(&counter)
 fmt.Println("After pointer inc ", counter)
}

Stack frame using pointer

  
In above example parameter to function is still passed by value but this time it is of address type.
Caller knows that it has received address(&variable) of variable and to change the value , it has to use different instruction (*variable) 

An asterisk (*) operator allow program to change the variable that is outside of its own stack frame, this variable can be in heap or caller function stack.

Having clear distinction when value is passed vs address of value is very power full thing as it tells that which function are doing read vs write.
Anytime you see pointer (&) , it is very clear that some mutation is happening in function.

No magical modification is possible.

Having clear distinction has couple of advantage 
 - Compiler can do escape analysis to determine what gets allocated to stack vs heap. This keeps GC happy because stack allocation are cheap and heap has GC overhead
 - When to copy value vs share value. This is very useful thing for large values, you don't want to copy 1gb of buffer to function.

Go lang gives options to developer to choose trade off rather than giving no control.

Lets look at one more example on how allocation works


func allocateOnStack() stock {

google := stock{symbol: "GOOG", price: 1109}
return google
}

func allocateOnHeap() *stock {

google := stock{symbol: "GOOG", price: 1109}
return &google
}

Both of the above function is creating stock value but look at return type, one returns value(allocateOnStack) and other one (allocateOnHeap) returns address.
Compiler looks the return type and make a decision on what goes on stack vs heap.
So you decided what you want to throw at GC vs keep it happy.

You might have question on Stack like how big is stack ?
Each Go routine starts with 2 MB stack size, it is small and good enough to hold lots of functions call.
For most of the cases 2 MB is good but if program continues to put memory pressure on Stack then it grows to adjust the need only for specific Go routine.
Stack growth has allocation & copy cost, it is just like allocate new array and copy the value from previous array.

One nice thing about Stack memory is that it is monitored by GC and it will reduce the size of Stack if utilization of stack is around 25%.  


Go gives power of compact memory layout using Struct and efficient memory allocation using pass by value.

All the samples used in this blog is available @ pointers github repo