Are you ready: Simple Testing Can Prevent Most Critical Failures

Tuesday, 23 October 2018

Simple Testing Can Prevent Most Critical Failures

Error handling is one of the hardest and ignored part of software development and if system is distributed then this becomes even harder.

Nice paper is written on Simple Testing Can Prevent Most Critical Failures topic.
Every developer should read this paper. I will try to summarized key take away from this paper but will suggest to read the paper to get more details about it.

Distributed system outage is common and some of the recent example are

Youtube was down on Oct,2018 for around 1+ hour
Amazon was down during Prime day on July,2018
Google services like Map,Gmail,Youtube were down numerous time in 2018
Facebook was also down apart from many data leak issues they are facing.

This paper talks about catastrophic failure that happened in distributed system like Cassandra, Hbase , HDFS, Redis, Map Reduce.

As per paper most of the errors are due to 2 reason

- Failure happens due to complex sequence of events
- Catastrophic error are due to incorrect handling
- I will include 3rd one on "ignoring of design pressure" which i wrote in design-pressure-on-engineering-team post

Example from HBase outage

1 - Load balancer Transfer region R from Slave A to Slave
2 - Slave B open region R
3 - Master delete current Zookeeper region R after it is owned by Slave B
4 - Slave B dies
5 - Region R is assigned to Slave C & Slave C open the region
6 - Master tries to delete Slave B znode on Zookeeper and because Slave b is down and whole cluster goes down due to wrong error handling code.

In above example sequence of event matters to reproduce issue.

HDFS failure when block is not replicated.

In this example also sequence of event and when new data node starts it exposes bug of system.

Paper has many more examples.

Root cause of error

92% of the catastrophic error happens due to incorrect error handling.
What this means is that error was deducted but error handling code was not good, does this sound like lots of project you have worked on !

1 - Error are ignored

This is reason of 25% of the failure, i think number will be high in many live system.

eg of such error
catch(RebootException e) {
log.info("Reboot occurred....")
}

Yes this harmless looking log statement is ignoring exception and is very common anti pattern of error handling.

2 - Overcatch exception

This is also very common like having generic catch block and bringing down the whole system

catch(Throwable e) {
cluster.abort()
}

3 - TODO/FIXME in comments

Yes real distributed system in production also has lots of TODO/FIXME in critical section of code.

Some other example of error handling

} catch (IOException e) {
// will never happen
}

} catch (NoTransitionException e) {
/* Why this can happen? Ask God not me. */
}

try { tableLock.release(); }
catch (IOException e) {
LOG("Can't release lock”, e);
}

4 - Feature development is prioritized

I think all the software engineers will agree to it. This is also called Tech Debt and i can't think of better example than Knight Capital bankruptcy which was due to config & experimental code.

Conclusion

All the errors are complex to reproduce but better unit test will definitely catch these, this also shows that unit/integration test done in many system is not testing scenario like service going down and coming back again and how it impacts system.

Based on above example it will look like all error are due to java checked exception but it is not different in other system like C/C++ which does not have checked but everything is unchecked , it is developer responsibility to check for it at various places.

On side note language with no type system like Python makes it very easy to write code that will break at runtime and if you are really unlucky then error handling code will have some type error and it will get tested in production.

Also almost all product will have some static code tool(findbugs) integration but these tools does not give more importance to such error handling anti pattern.

Link to issues mention in paper
HDFS
MapReduce
HBase
Redis
Cassandra

Please share about more anti pattern you have seen in production system.
Till then Happy unit testing.

Are you ready

Tuesday, 23 October 2018

Simple Testing Can Prevent Most Critical Failures

No comments:

Post a Comment