Are you ready: October 2018

Friday, 26 October 2018

Broken promise of Agile

AgileManifesto was written 17 years back(i.e 2001) and is it able to bring the change to industry ?

I would say yes but not is the way authors wanted.

Many consulting company made millions of $ but as software engineer i did not see the change.

How did Agile broke promise

I will put key things that authors wanted so we have some context to discuss about this

Individuals and interactions over processes and tools
Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

This looks so good :-)

Lets start with education industry

"Attend Agile training for 2 days and you are certified Scrum master or Agile Developer"

What did people learn after these workshop ? Standup, Planning Poker, retrospective , backlog grooming , JIRA and many more things.

One thing that is missing is "Agile mindset" , no one can teach or learn this in 2 days so it is big joke that you team went to expensive training and they are Agile Team.

Lets go over main items so see how industry see main goals of Agile

Individuals and interactions over processes and tools

Almost all the team got this wrong, they thought we need more tools JIRA was born, industry created big ceremony(backklog groming, standup, reto etc)
I think have too many things in JIRA tell only one thing that software quality is bad or team is too slow to delivery features or product team is creating shopping list of features that they never wanted.

Now it has gone to extend that you need multiple product owner for just grooming session and scrum masters for running retrospective meeting and to add more project manager to track story points/burn down etc.

Team is spending more time on Jira board rather than talking to team. We killed individual& interactions .

How many tools or process we added ? it is count less .

Working software over comprehensive documentation

People got confused with this and they said i need working software + document. Now nothing got done properly and some team will come and say we are AgileWater it mean we build software fast but also use documents that is required for waterfall.
Developer endup doing more non productive work.

Working Software was also take as team is allowed to release crap in production because we are Agile.
What was meant by working software was MVP not 20% or 40% developed item, feature is done or not done there is nothing like 50% when that feature is released to production.

Team is put in so much pressure to release that they end up taking shortcut and to address that "Tech Debt" drive is required.

When software team goes to product to get fund/approval to fix all the Tech Debt then team comes and say why did you develop & deploy crap piece of software.

Customer collaboration over contract negotiation

This is classic customer used Agile for blackmailing or question team credential to put last minute change. This gives them licences to change the requirement any time.
Not to miss commitment that is taken inform of Story Points and if team miss that then it is expected that they need to put extra hours.

Dev team is not different they use this as excuse to build bad quality of software.

Responding to change over following a plan

Agile never said that build feature without tests, architecture or no plan.
Planning is must and good architecture also but it should be Just enough to move in right direction and it is continuous architecture without end state.

Team took this like no design or architecture to response to change.

Conclusion

One of my favorite questions about sprint is

"How long is your sprint ?"
2/3 weeks ?

"Why 2 weeks , why not 1 month or 1 year ? who decided this ?"
It is written in Agile manifesto or my manager told this or i don't know other teams are doing this .

"How long ticket sits in backlog before making to production ?"
I don't know . Check with my TPM or if some one knows they will come and tell 1 year or 3 year"

Agile was about giving team freedom to choose sprint size based on when they are ready for feedback or customer is ready to give feedback
Sprint can 1 week or 6 months but key point is you should get the feedback after that and adjust.
If customer are not in the feedback loop then go back to waterfall .

Another thing was about Software Craftmanship.
Agile project has so many non developer like Project manager, Project manager, Scrum Master, Agile coach that they don't value Craftmanship , so we developer start new conference on this and get more disconnected.

Agile was written by developer for developer but now we are out and this place is taking by non-developers .

In Agile Conference ask the question "how many developer? "
You will see less hands :-( because they are in other craftman conference.

Agile project are about project management, dates, money , time.
Manager makes sure that plan is made by them and followed by team.

Today all project are agile but they still fail, over budget, never on time.

Any process like Agile has one hidden feedback that is called "Dissatisfaction" and you need respond to that change to become better.

Our Software industry has three inevitable things
- Degradation
- Dysfunction
- Expiry

Degradation -> Maintaining,Transformation
Dysfunction -> Innovation & Challenge
Expiry -> Creating & Starting over

Degradation ,Dysfunction & Expiry applies to people, project, team,process , strategy , organization.

Agile is no different identify the phase and create version 2 of process or find new one that works.

Tuesday, 23 October 2018

Simple Testing Can Prevent Most Critical Failures

Error handling is one of the hardest and ignored part of software development and if system is distributed then this becomes even harder.

Nice paper is written on Simple Testing Can Prevent Most Critical Failures topic.
Every developer should read this paper. I will try to summarized key take away from this paper but will suggest to read the paper to get more details about it.

Distributed system outage is common and some of the recent example are

Youtube was down on Oct,2018 for around 1+ hour
Amazon was down during Prime day on July,2018
Google services like Map,Gmail,Youtube were down numerous time in 2018
Facebook was also down apart from many data leak issues they are facing.

This paper talks about catastrophic failure that happened in distributed system like Cassandra, Hbase , HDFS, Redis, Map Reduce.

As per paper most of the errors are due to 2 reason

- Failure happens due to complex sequence of events
- Catastrophic error are due to incorrect handling
- I will include 3rd one on "ignoring of design pressure" which i wrote in design-pressure-on-engineering-team post

Example from HBase outage

1 - Load balancer Transfer region R from Slave A to Slave
2 - Slave B open region R
3 - Master delete current Zookeeper region R after it is owned by Slave B
4 - Slave B dies
5 - Region R is assigned to Slave C & Slave C open the region
6 - Master tries to delete Slave B znode on Zookeeper and because Slave b is down and whole cluster goes down due to wrong error handling code.

In above example sequence of event matters to reproduce issue.

HDFS failure when block is not replicated.

In this example also sequence of event and when new data node starts it exposes bug of system.

Paper has many more examples.

Root cause of error

92% of the catastrophic error happens due to incorrect error handling.
What this means is that error was deducted but error handling code was not good, does this sound like lots of project you have worked on !

1 - Error are ignored

This is reason of 25% of the failure, i think number will be high in many live system.

eg of such error
catch(RebootException e) {
log.info("Reboot occurred....")
}

Yes this harmless looking log statement is ignoring exception and is very common anti pattern of error handling.

2 - Overcatch exception

This is also very common like having generic catch block and bringing down the whole system

catch(Throwable e) {
cluster.abort()
}

3 - TODO/FIXME in comments

Yes real distributed system in production also has lots of TODO/FIXME in critical section of code.

Some other example of error handling

} catch (IOException e) {
// will never happen
}

} catch (NoTransitionException e) {
/* Why this can happen? Ask God not me. */
}

try { tableLock.release(); }
catch (IOException e) {
LOG("Can't release lock”, e);
}

4 - Feature development is prioritized

I think all the software engineers will agree to it. This is also called Tech Debt and i can't think of better example than Knight Capital bankruptcy which was due to config & experimental code.

Conclusion

All the errors are complex to reproduce but better unit test will definitely catch these, this also shows that unit/integration test done in many system is not testing scenario like service going down and coming back again and how it impacts system.

Based on above example it will look like all error are due to java checked exception but it is not different in other system like C/C++ which does not have checked but everything is unchecked , it is developer responsibility to check for it at various places.

On side note language with no type system like Python makes it very easy to write code that will break at runtime and if you are really unlucky then error handling code will have some type error and it will get tested in production.

Also almost all product will have some static code tool(findbugs) integration but these tools does not give more importance to such error handling anti pattern.

Link to issues mention in paper
HDFS
MapReduce
HBase
Redis
Cassandra

Please share about more anti pattern you have seen in production system.
Till then Happy unit testing.

Friday, 19 October 2018

When microservices becomes darkservices

Micro services is great and many company comes and talk about it on how it is used for scaling team, product etc

Microservices has dark side also and as a programmer you should about it before going on ride.
In this post i will share some of the myths/dark side about micro services

We needs lots of micro services

Before you create any new micro services think about distributed computing because most of the micro services are remote process. First define what "micro" means in problem context it could be lines of code , features or deployment etc

Naming micro services will be easy

Computer science has only 2 complex problem and one of them is "naming", very soon you will run out of options when you have 100s of them.

Non functional requirement can be done later

Suddenly non function requirement like ( latency, throughput, security, reliability etc) becomes very important from day one.

Polyglot programming/persistence or something poly...

Software engineer likes to try latest cutting edge tool so they get carried away by this myth that we can use any language or any framework or any persistence.

Think about skills and maintenance overhead required for poly.... thing that is added, if you have more than 2/3 things then it is not going to fit in head and you have to be on pager duty.

Monitoring is easy

This is one of the most ignored fact about micro services and come as afterthought.
For simple investigation you have to login to many machines , looks in logs , make sure you get the timing right on server etc.

Without proper monitoring tools you can't do this, you need ELK or DataDog type of things.

Read and writes are easy

This thing also get ignored now you are in distributed transaction world and it is not good place to be in and to handle this you need eventual consistent system or non available system.

Everything is secure

Now one service is talking to another services using API, so you need good auth system to make sure your system is secure. If you work in financial system then you will be spending more time in answering security related questions.

My service will be always up

That will never happen no matter how good programmer or infra you have, service will go down and now you are in Middleware land(Kafka,ActiveMq,ZeroMQ etc) to handle this , so that request can be queued while service was not available.

I can add break point to debug it

This is just not possible because now you are in remote process and don't know how many micro services are involved in single request.

Testing will be same

Testing is never same as monolithic, you need better automated test to get out of testing hell.

No code duplication

As you add more services, code sharing becomes hard because any change in some common code required good testing and to avoid that many team start code duplication.

JSON over HTTP

This is one of the biggest myth that all micro services must have Json over Http and it is user facing.

This has resulted in explosion of REST based API for every micro services and is the reason of why many system are slow because they used text based protocol with no type information.

One thing you want to take away from anti pattern of micro services is that rethink that do you really need Json/REST for every service or you can use other optimized protocol and encoding.

Versioning is my grandfather job

Since most of the micro services are remote process , so you have to come with request/response spec and have to manage version for backward compatibility.

Team communication remains same.

This is like hidden elephant in room with more services more team communication is required to keep them posted about what is current version, where it is running , what is broken etc.

You can have more silos because no one knows about whole system

Your product is of google/facebook/netflix scale

This is like buy lottery ticket that you are never going to win.

If you can't write decent modular monolithic then don't try micro services because it is all about getting correct coupling and cohesion. Modules should be loosely coupled and high cohesive.

No free lunch with micro services and if you get it wrong then you will be paying premium price :-)

Thursday, 18 October 2018

Setup SSL in Jetty

Have you faced issues when you have to quickly enable SSL and you got stuck with it :-(

You are not alone, i will share my pain and some learning.

I will share steps to enable SSL on jetty.

Warning: Use below instruction only for dev setup and for production contact your security expert !

Install jetty on your server

Setup some env variable for convenience like

export jetty_home=.../somejetty

export jetty_base = .../your_application_install_location

It is recommended to keep jetty base out side of jetty installation otherwise you will have classpath nightmare

Execute below command to create initial setup for SSL

java -jar $jetty_home/start.jar --add-to-startd=ssl jetty.base=$jetty_base

Once you run above command you will see something like below on console.

Add below line ${jetty.base}/start.d/ssl.ini

--module=https

Check ssl port(jetty.ssl.port) and change it accordingly

Add below line in ${jetty.base}/start.ini

jetty.ssl.port=port
Use same port as ssl.ini file.

Start the server

java -jar $jetty_home/start.jar jetty.base=$jetty_base

You are done :-) Jetty starts on ssl .

Magic Questions

- Which certificate is used by jetty ?

That is the magic, jetty ships with certificate that is already imported in keystore that jetty is using.

Jetty looks for keystore in $jetty_base/etc/keystore location.

- What is password of keystore

Key store password is $jetty_base/start.d/ssl.ini , but it is encrypted. You can use below command to get the password.
java -cp jetty-util-9.2.14.v20151106.jar org.eclipse.jetty.util.security.Password "OBF:1vny1zlo1x8e1vnw1vn61x8g1zlu1vn4"

it is "storepwd"

- How to see what is in key store ? run the below command and enter password
keytool --list -v -keystore keystore

If jetty gives some error like password is wrong or tampered then copy the keystore from $jetty_home/etc/keystore to $jetty_base/etc

It takes only 5 minutes to perform all the steps but only if you know otherwise it is day long frustration. Enjoy development with jetty.