Tuesday 26 June 2018

Black Swans and Antifragility


All of us who work in software development will have the scars caused by a disastrous update, outage or product launch. These experiences shape the way we approach our roles in the future, we are more attuned to possible catastrophe, planning strategies to deal with things going wrong and positively expect them too.

The Black Swan Theory, postulated by Nassim Nicholas Taleb, deals with the nature of unexpected events, although this is in a wider context then technology there are none the less parallels to the kinds of events as IT professionals we have to react to.

Within this theory a black swan is an event with the following properties:
  • It is an outlier, not expected, with past experience not pointing to its possibility.
  • It carries an extreme impact.
  • Human nature makes us concoct explanations for its occurrence after the fact, making it explainable and predictable.

In software engineering events that could be considered black swans might be sudden increases in the load placed on a system, catastrophic hardware failure or a breach in security.

So are we doomed to the consequences of these events or can we architect our systems to try and cope with the aftermath.

Modularity and Weak Links

Black swan events have an increased, or at least more sustained, impact on complex systems.

Complexity breeds mistakes and causes solutions to become harder to envisage, therefore a strategy for combating complexity can help combat the cause and effect of disastrous events.

Composition by breaking down a systems problems space into multiple simplified chunks can work at many levels, from individual blocks of code to whole sub-systems.

Allowing these blocks to be swappable and having the ability to re-configure and re-organise them not only allows functionality to be easily changed it also allows a non-functioning areas of a system to be quickly fixed or replaced.

The links between modules can transmit stress and failure as well as functionality, the weaker they are the easier they can be broken when necessary.

Redundancy and Diversity

Black swan events related to hardware or service failure become more catastrophic when no alternative is available.

Redundancy and diversity are strategies for ensuring that alternatives do exist to ensure continuity of service. Discussions around redundancy and diversity can get caught up in slightly pedantic arguments, for the purposes of this discussion lets try and simplify.

Redundancy can be viewed as having more than one of a particular resource while diversity can be thought of as having more than one channel for the functionality.

Using the example of databases, redundancy would be achieved by having back-ups, mirroring or replication whereas diversity might be achieved by using multiple cloud providers for hosting data.

Testing and Probability

No strategy for dealing with failure can be said to be fully implemented unless it has been proven to be effective via testing.

This verification or testing can be as simple as ensuring a database backup is valid and can be restored from , it can also be as sophisticated and automated as the chaos monkey techniques employed by companies such as Netflix.

There have been many instances of companies who believe they have plans to cover every eventuality being left floundering once disaster strikes.

If we were to list all the possible disasters that could befall us they would be numerous. each having a probability and a level of impact on our system.

These two factors need to be balanced along with the cost and effort of mitigation. Whenever protecting against the scenario is relatively straightforward this should be implemented regardless of likelihood.

For everything else the impact of the event should be balanced against its probability, in these scenarios don't be too quick to write off an event as improbable, if it would bring your system to its knees then its worth considering having a strategy.

Trying to asses the potential failures in your system is a good way of assessing your architecture and infrastructure, highlighting technical debt or areas for improvement. If at the same time you can develop strategies to try and reduce fragility then you be able to sleep more easily in your bed.