Thursday, 1 April 2021

Creating Chaos

 


In software development chaos engineering is the process of running experiments against a system in order to build confidence in its ability to withstand unexpected conditions or changes in environment.

First developed in 2011 by Netflix as part of their adoption of cloud infrastructure, it's underlying principles have been applied to many situations but typically experiments include things such as: 

  • Deliberately causing infrastructure failures, such as bringing down application servers or databases.
  • Introducing less favourable network conditions by introducing increased latency, packet loss or errors in essential services such as DNS.

In an attempt to automate these experiments Netflix developed a tool called Chaos Monkey to deliberately tear down servers within its production environment. The guarantee that they would see these kinds of failures helped foster an engineering culture of resilience and redundancy.

We may not all be brave enough to run these experiments within our production environment but if we choose to experiment in the safety of a test environment then what principles should be following?

Steady State Hypothesis

A secondary advantage to chaos engineering is the promotion of metrics within the system. If you are to run automated experiments against your system then you must be able to measure their impact to determine how the system coped. If the system behaviour was not observed to be ideal and changes are made then metrics also act as validation that the situation has improved.

Before running an experiment you should define an hypothesis around what you consider the  steady state of your system to be. This might involve error rates, throughput of requests or overall latency. As your experiment runs these metrics will indicate if your system is able to maintain this steady state despite the deterioration in the environment.

Vary Real World Events

It's important that the mechanisms you use to degrade the environment are representative of the real world events your system might have to cope with.  We are not looking to simulate an event such as server failing we are actually going to destroy it.

How you choose to approach the make up of the failures being introduced is likely to depend on the impact such an event could potentially have and\or the frequency at which you think such an event might occur.

The important consideration is that there should be some random element to the events. The reason for employing chaos engineering is to acknowledge the fact that for any reasonably complicated system it is virtually impossible to accurately predict how it will react. Things that you may have thought cannot happen may turn out to be possible.

Automate Continual Experiments

As you learn to implement the principles of chaos engineering you may rely on manual experimentation as part of a test and learn approach. However this can be an intensive process, the ultimate goal should be to develop the ability to run continual experiments by introducing a level of automation to the experiments.

Many automated tools, including Chaos Monkey, now exist to aid this type of automation. Once you have an appreciation on the types of experiments you want to run, and are confident your system produces the metrics necessary to judge the outcome, then these tools should be used to regularly and frequently run experiments.

The principles of chaos engineering are finding new application in many different aspects of software development, including topics such as system security for example by deliberately introducing infrastructure that doesn't conform to security best practices to measure the systems response and it's ability to enforce policy.

Not every system will lend it's self to a chaos engineering approach, for example an on-premise system where servers are not as easily destroyed as is the case in the cloud may limit options for running experiments. There also needs to be consideration as to the size of the potential blast radius for any experiment and a plan for returning to previous environmental conditions should the system fail to recover.

Your system's reaction to a large number of the experiments you run will likely surprise you in both good and bad ways. As previously stated for a system of any reasonable complexity it is unrealistic to expect to have an accurate view of how the system works under all possible conditions, the experiments you run are a learning exercise to try and fill in these gaps in your knowledge and ensure you are doing all you can to make sure your system performs the role your users want it to.

No comments:

Post a Comment