Are you ready: Error Handling

Showing posts with label Error Handling. Show all posts

Thursday, 25 July 2019

Exception handling

In this post i will share how error handling is done and what options we have.Error handling is complex topic :-)

Image result for error handling

I will add some context from wikipedia on what is exception handling before going down the rabbit hole of exception handling

Exception handling is the process of responding to the occurrence, during computation, of exceptions – anomalous or exceptional conditions requiring special processing – often disrupting the normal flow of program execution. It is provided by specialized programming language constructs, computer hardware mechanisms like interrupts or operating system IPC facilities like signals.

In general, an exception breaks the normal flow of execution and executes a pre-registered exception handler. The details of how this is done depends on whether it is a hardware or software exception and how the software exception is implemented. Some exceptions, especially hardware ones, may be handled so gracefully that execution can resume where it was interrupted.

Source - https://en.wikipedia.org/wiki/Exception_handling

Few things that is highlighted are "disrupting the normal flow of program execution" , "pre-registered exception handler" , hardware or software exception.
So it explain what error handling is , so i will not spend time on that.

One interesting thing mention is 2 types are exception exists( hardware & software), hardware friends have handled very gracefully and it on software engineer to do part.

Software one are the hard and too many programming languages makes it even harder.
I want you refer to simple-testing-can-prevent-most post on which i try to explain the side effect of wrong error handing and it pure as the result of exception handling pattern.

C way
I am sure if you seen this and have thought that "is this the right way ?".
Code snippet of C error handling

This approach has several issue
- You have to check for error after calling every function that can fail.
- No safety from compiler on forcing/indicating that error can be thrown at this point
- Error handling is completely optional

Java Way

Then came java and came with mindset let me fix all the error handling and they invented checked/unchecked exception.
Look at code snippet

This approach has every more issues
- Code is full with verbosity of error handling code
- Compiler forces you to handle checked exception in wrong way(i.e just log it or ignore it)
- Nothing meaningful is done apart from log something and wrap it in RuntimeExcetion to get passed compiler.
- Wrapping makes things worse because you start loosing context on what caused error

Functional Programming Way
This world has to do better than "imperative" world and what they did ? Invented Monads.

Things to consider when using this approach
- You have to learn fancy technical jargon of Category theory or monads
- Now gives 2 value and you have to write little less verbose code to handle both the path
- Performance issues due to extra wrapping of value and when you have millions of them then it hits you very hard
- No compile time safety , caller have an option to get around this by directly getting the value
- and i think this was attempt to fix Optional/MayBe value, in which you don't know why value is not available.
- Stacktrace is gone and in some case it is useful especially building system that is calling third party libs

Go Lang Way
GoLang wanted to do better than C/Java and took some inspiration from Elm language and came up with delegation or Killer(i.e Panic) approach

This is interesting approach but
- With err return in very function call , caller has to add error handling code
- Trace is lost, so you have to very careful in adding all the context to message so that recovery is possible.
- Panic is not good for library or framework because you can't kill the process , it has to be client responsibility to decided on what to do.

JavaScript/Python way
I will leave this for now.

No clear winner in which option is best and each language is doing some trade-off.We don't exceptions like java ans also like Go Lang, it is 2 end of the pendulum.

What could be good is having caller option to decided on what approach to use it could be Java Style or Go Lang.
Better way to separating control flow and error because in some case default value on error could be good option or just cleanup/recovery or send to upper layer to better handing.

So code in catch block tell lot about what client want and that should decided what error handling pattern you should use. I think it is more about education on what is right in context and use the pattern.

Happy to know about what you think about error handling and how it should done.

Tuesday, 23 October 2018

Simple Testing Can Prevent Most Critical Failures

Error handling is one of the hardest and ignored part of software development and if system is distributed then this becomes even harder.

Nice paper is written on Simple Testing Can Prevent Most Critical Failures topic.
Every developer should read this paper. I will try to summarized key take away from this paper but will suggest to read the paper to get more details about it.

Distributed system outage is common and some of the recent example are

Youtube was down on Oct,2018 for around 1+ hour
Amazon was down during Prime day on July,2018
Google services like Map,Gmail,Youtube were down numerous time in 2018
Facebook was also down apart from many data leak issues they are facing.

This paper talks about catastrophic failure that happened in distributed system like Cassandra, Hbase , HDFS, Redis, Map Reduce.

As per paper most of the errors are due to 2 reason

- Failure happens due to complex sequence of events
- Catastrophic error are due to incorrect handling
- I will include 3rd one on "ignoring of design pressure" which i wrote in design-pressure-on-engineering-team post

Example from HBase outage

1 - Load balancer Transfer region R from Slave A to Slave
2 - Slave B open region R
3 - Master delete current Zookeeper region R after it is owned by Slave B
4 - Slave B dies
5 - Region R is assigned to Slave C & Slave C open the region
6 - Master tries to delete Slave B znode on Zookeeper and because Slave b is down and whole cluster goes down due to wrong error handling code.

In above example sequence of event matters to reproduce issue.

HDFS failure when block is not replicated.

In this example also sequence of event and when new data node starts it exposes bug of system.

Paper has many more examples.

Root cause of error

92% of the catastrophic error happens due to incorrect error handling.
What this means is that error was deducted but error handling code was not good, does this sound like lots of project you have worked on !

1 - Error are ignored

This is reason of 25% of the failure, i think number will be high in many live system.

eg of such error
catch(RebootException e) {
log.info("Reboot occurred....")
}

Yes this harmless looking log statement is ignoring exception and is very common anti pattern of error handling.

2 - Overcatch exception

This is also very common like having generic catch block and bringing down the whole system

catch(Throwable e) {
cluster.abort()
}

3 - TODO/FIXME in comments

Yes real distributed system in production also has lots of TODO/FIXME in critical section of code.

Some other example of error handling

} catch (IOException e) {
// will never happen
}

} catch (NoTransitionException e) {
/* Why this can happen? Ask God not me. */
}

try { tableLock.release(); }
catch (IOException e) {
LOG("Can't release lock”, e);
}

4 - Feature development is prioritized

I think all the software engineers will agree to it. This is also called Tech Debt and i can't think of better example than Knight Capital bankruptcy which was due to config & experimental code.

Conclusion

All the errors are complex to reproduce but better unit test will definitely catch these, this also shows that unit/integration test done in many system is not testing scenario like service going down and coming back again and how it impacts system.

Based on above example it will look like all error are due to java checked exception but it is not different in other system like C/C++ which does not have checked but everything is unchecked , it is developer responsibility to check for it at various places.

On side note language with no type system like Python makes it very easy to write code that will break at runtime and if you are really unlucky then error handling code will have some type error and it will get tested in production.

Also almost all product will have some static code tool(findbugs) integration but these tools does not give more importance to such error handling anti pattern.

Link to issues mention in paper
HDFS
MapReduce
HBase
Redis
Cassandra

Please share about more anti pattern you have seen in production system.
Till then Happy unit testing.