The majority of applications will make use of some kind of data storage solution. In order to maintain the integrity of this data storage a large number of solutions will involve the concept of transactions in order to manage updates and modifications.
Intertwined with the concept of transactions is that of concurrency, executing transactions sequentially may ensure the protection of the data store but the application would be severely hampered by the lack of throughput.
In order to strike a balance between performance and consistency several strategies have evolved to deal with concurrently executing transactions.
ACID Transactions
Before we deal with concurrency we should first define the properties of a transaction. A transaction represents a set of instructions to run against the data store. Generally this will involve modifications to the data being stored and may result in data being returned to the caller.
In order for transactions to be executed whilst maintaining the integrity of the data store their implementation should follow certain rules. These rules are often characterised by the acronym ACID:
Atomicity: The instructions being executed within the transaction may have side effects on the data being stored. The execution of a transaction should be atomic in nature meaning either all side effects persist or none of them. This leads to the concept of transactions being "rolled back" should an error occur during execution.
Consistency: All transactions should leave the data store in a consistent state. The definition of consistency will vary between data stores but transactions should not leave the data in an unknown or inconsistent state according to whatever rules may be in place for the data set in question.
Isolation: Transactions should not be aware of each other or otherwise interact. Any errors that may occur in a transaction should not be visible to or affect other transactions.
Durability: Once a transaction has been successfully executed and "committed" to the data store then its effects should be persistent from that moment on regardless of any subsequent errors that may happen within the data store.
Concurrency Problems
So if we ensure that transactions follow the ACID rules why do we need further strategies for dealing with concurrency? Despite the ACID rules it is still possible for transactions to inadvertently cause errors when they are running concurrently against the same data set.
These problems can be intricate depending on the data being stored but some examples include:
Lost Update: Two transactions both operate on the same data item setting it to different values, this causes the update from the first transaction to be lost following the execution of the second transaction.
Dirty Read: Transactions read a value that is later rolled back following the transaction that originally set that value being aborted.
Incorrect Summary: A transaction presents an incorrect summary of a data set because a second transaction alters values while the first transaction is creating the summary.
Concurrency Control
The impact of the problems described in the previous section may vary between applications. This leads to a variety of strategies for trying to lessen or eradicate them from impacting the performance or correctness of the application. We won't here go into the detail of their implementation but they broadly fall into three categories:
Optimistic: An optimistic strategy assumes the chances of transactions clashing is low. Therefore checks on consistency and isolation are delayed until just before a transaction is committed allowing a high level of concurrency. If it turns out a problem has occurred then the transaction must be rolled back and executed again.
Pessimistic: A pessimistic strategy assumes the chance of errors is high and transactions should be blocked from executing concurrently if there is a possibility it could be the cause of an error.
Semi-Optimistic: A semi-optimistic strategy attempts to find a middle ground where transactions that appear safe are allowed to execute concurrently but transactions that appear to carry a risk are blocked from doing so.
Which strategy you choose is a balance between performance and consistency of data. An optimistic approach will provide higher performance by allowing a higher level of concurrency but with the possible overhead of transactions needing to be re-executed if a problem does occur. A pessimistic strategy will offer higher protection against errors but the blocking of concurrency, for example by table locking in a database, will reduce throughput.
Choosing the correct strategy will vary depending on the nature of your data set and the transactions you need to perform on it. Understanding the impacts and benefits of each type of strategy may help you develop your schema or approach to operating on the data to make the most of each approach where appropriate.
Sometimes it will be obvious you are using the wrong strategy, you may see a high level of transactions being aborted and re-run, or you may be suffering from a lack of throughput because of an overly pessimistic approach. As with most things in software engineering it can be a grey area to decide which strategy is best for your needs but being armed with the costs and benefits of possible strategies will be invaluable in enabling you to make a choice.