Site Reliability Engineering (SRE) is the practice of applying software engineering principles to the management of infrastructure and operations.
Originating at Google in the early 2000s the sorts of things an SRE team might work on include system availability, latency and performance, efficiency, monitoring and the ability to deliver change.
Optimising these kinds of system aspects covers many different topics and areas, one of which is the management of toil.
Toil in this context is not work we don't particularly enjoy doing or don't find stimulating, it has a specific meaning defined by aspects other than our enjoyment of the tasks it involves.
What is Toil?
Toil is work that exhibits some or all of the following attributes.
It is Manual in nature, even if a human isn't necessarily doing the work it requires human initiation, monitoring or any other aspect that means a team member has to oversee its operation.
Toil is Repetitive, the times work has to be done may vary and may not necessarily be at regular intervals, but the task needs to be performed multiple times and will never necessarily be deemed finished.
It is Tactical meaning it is generally reactive, it has to be undertaken either in relation to something happening within the system for example when monitoring highlights something is failing or is sub-optimal.
It has No Enduring Value, this means it leaves the system in the same state as before the work happened. It hasn't improved any aspect of the system or eliminated the need for the work to happen again in the future.
It Scales with Service Growth. Some work items need to happen regardless of how much a system is used. This tends to be viewed as overhead and is simply the cost of having the system in the first place. Toil scales with system use meaning the more users you attract the greater the impact of the toil on your team.
Finally toil can be Automated, some tasks will always require human involvement, but for a task to be toil it must be possible for it to be automated.
What is Toils Impact?
It would be wrong to suggest that toil can be totally eliminated, having a production system being used by large numbers of people is always going to incur a certain amount of toil, and it is unlikely that the whole engineering effort of your organisation can be dedicated to removing it.
Also, much like technical debt, even if you do reach a point where you feel its eliminated the chances are a future change in the system will likely re-introduce it.
But also like technical debt the first step is to acknowledge toil exists, develop ways to be able to detect it and have a strategy for managing it and trying to keep it to a reasonable minimum.
Toils impact is that it engages your engineering resource on tasks that don't add to or improve your system. It may keep it up and running but that is a low ambition to have for any system
It's also important to recognise that large amounts of toil is likely to impact a teams morale, very few engineers will embark on their career looking to spend large amounts of time on repetitive tasks that lead to no overall value.
The Alternative to Toil
The alternative to spending time on toil is to spend time on engineering. Engineering is a broad concept but in this context it means work that improves the system itself or enables to to be managed in a more efficient way.
As we said previously completely eliminating toil is probably an unrealistic aim. But it is possible to measure how much time your team is spending on toil related tasks. Once you are able to estimate this then it is possible both to set a sensible limit on how much time is spent on these tasks but also measure the effectiveness of any engineering activities designed to reduce it.
This engineering activity might relate to software engineering, refactoring code for performance or reliability, automating testing or certain aspects of the build and deployment pipeline. It might also be more aimed at system engineering, analysing the correctness of the infrastructure the system is running on, analysing the nature of system failures or automating the management of infrastructure.
As previously stated we can view toil as a form of technical debt. In the early days of a system we may take certain shortcuts that at the time are manageable but as the system grows come with a bigger and bigger impact. Time spent trying to fix this debt will set you on a path for gradual system improvement, both for your users and the teams that work on the system.