Sunday 20 September 2020

Transactions and Concurrency

 


The majority of applications will make use of some kind of data storage solution. In order to maintain the integrity of this data storage a large number of solutions will involve the concept of transactions in order to manage updates and modifications.

Intertwined with the concept of transactions is that of concurrency, executing transactions sequentially may ensure the protection of the data store but the application would be severely hampered by the lack of throughput.

In order to strike a balance between performance and consistency several strategies have evolved to deal with concurrently executing transactions.

ACID Transactions

Before we deal with concurrency we should first define the properties of a transaction. A transaction represents a set of instructions to run against the data store. Generally this will involve modifications to the data being stored and may result in data being returned to the caller. 

In order for transactions to be executed whilst maintaining the integrity of the data store their implementation should follow certain rules. These rules are often characterised by the acronym ACID:

Atomicity: The instructions being executed within the transaction may have side effects on the data being stored. The execution of a transaction should be atomic in nature meaning either all side effects persist or none of them. This leads to the concept of transactions being "rolled back" should an error occur during execution.

Consistency: All transactions should leave the data store in a consistent state. The definition of consistency will vary between data stores but transactions should not leave the data in an unknown or inconsistent state according to whatever rules may be in place for the data set in question.

Isolation: Transactions should not be aware of each other or otherwise interact. Any errors that may occur in a transaction should not be visible to or affect other transactions. 

Durability: Once a transaction has been successfully executed  and "committed" to the data store then its effects should be persistent from that moment on regardless of any subsequent errors that may happen within the data store. 

Concurrency Problems

So if we ensure that transactions follow the ACID rules why do we need further strategies for dealing with concurrency? Despite the ACID rules it is still possible for transactions to inadvertently cause errors when they are running concurrently against the same data set.

These problems can be intricate depending on the data being stored but some examples include:

Lost Update: Two transactions both operate on the same data item setting it to different values, this causes the update from the first transaction to be lost following the execution of the second transaction.

Dirty Read: Transactions read a value that is later rolled back following the transaction that originally set that value being aborted.

Incorrect Summary: A transaction presents an incorrect summary of a data set because a second transaction alters values while the first transaction is creating the summary.

Concurrency Control

The impact of the problems described in the previous section may vary between applications. This leads to a variety of strategies for trying to lessen or eradicate them from impacting the performance or correctness of the application. We won't here go into the detail of their implementation but they broadly fall into three categories:

Optimistic: An optimistic strategy assumes the chances of transactions clashing is low. Therefore checks on consistency and isolation are delayed until just before a transaction is committed allowing a high level of concurrency. If it turns out a problem has occurred then the transaction must be rolled back and executed again. 

Pessimistic:  A pessimistic strategy assumes the chance of errors is high and transactions should be blocked from executing concurrently if there is a possibility it could be the cause of an error.

Semi-Optimistic: A semi-optimistic strategy attempts to find a middle ground where transactions that appear safe are allowed to execute concurrently but transactions that appear to carry a risk are blocked from doing so. 

Which strategy you choose is a balance between performance and consistency of data. An optimistic approach will provide higher performance by allowing a higher level of concurrency but with the possible overhead of transactions needing to be re-executed if a problem does occur. A pessimistic strategy will offer higher protection against errors but the blocking of concurrency, for example by table locking in a database, will reduce throughput.

Choosing the correct strategy will vary depending on the nature of your data set and the transactions you need to perform on it. Understanding the impacts and benefits of each type of strategy may help you develop your schema or approach to operating on the data to make the most of each approach where appropriate.

Sometimes it will be obvious you are using the wrong strategy, you may see a high level of transactions being aborted and re-run, or you may be suffering from a lack of throughput because of an overly pessimistic approach. As with most things in software engineering it can be a grey area to decide which strategy is best for your needs but being armed with the costs and benefits of possible strategies will be invaluable in enabling you to make a choice.      

Sunday 13 September 2020

Anaemic Models and Domain Driven Design

 


Domain Driven Design (DDD) is an approach to software development where the classes ,including their methods and properties, are designed to reflect the business domain in which they operate.

Some would argue this was the whole motivation behind Object Orientated Programming (OOP) in the first place, however many code bases don't actually take this approach. Either classes are written to reflect the internal structure of the code rather than the domain, or they tend to be operated upon rather than performing the operations themselves.

This has lead to advocates of DDD defining anaemic models as a prevalent anti-pattern.

Anaemic Models

An anaemic model can be thought of as a class that purely acts as a container for data, typically they will consist solely of getter and setters with no methods that perform operations on the underlying data.

The reason this is often seen as an anti-pattern is that this lack of inherent domain logic allows the model to be in an invalid state, whether or not this invalid state is allowed to affect the system as a whole is dependent on other classes recognising this invalid state and either fixing it or raising appropriate errors.

The fact this logic exists in other classes also raises the possibility that the all important business logic of the domain is scattered across the code base rather than being in a central place. This acts as a barrier to fully understanding the logic of the domain and can easily lead to bugs when refactoring of the code base is undertaken based on these misunderstandings.

The driver behind these anaemic models is often strict adherence to principles such as the Single Responsibility Principle (SRP) driving a want to separate representation of data from the logic that acts upon it.

DDD's answer to these issues is to centralise both the storage of data and the logic that acts upon it in the same object.

Aggregate Root

An aggregate root acts as a collection of objects all bound by a common context within a domain. As an example an aggregate root representing a customer might contain objects representing that customers, address, contact details, order history, marketing preferences etc.

The job of the aggregate root is to ensure the consistency and validity of the domain by not allowing external objects to hold a reference to or operate on the data in its collection. External code that wishes to operate on the domain must call methods on the aggregate root where these methods will enforce the logic of the domain and not allow operations that would put the domain in an invalid state.

As well as ensuring validity the aggregate root also acts as central documentation for the domain and the associated business logic, aiding understanding of the domain and allowing safer refactoring.

No Anaemic Models?

So should a code base never contain anaemic models? There is a practicality argument indicating that they cannot always be avoided.

Strict adherence to domain rules does come at a price, hydration of these models from API responses or data storage is often complicated by not being friendly to standard deserialisation or instantiation from the execution of a query. This will often lead to the need for anaemic models to act as Data Transfer Objects (DTOs) in these situations to bring the data into an application before then applying it to a domain.

The important point here is that not all models are designed to represent a domain or have business rules associated with them, some models role is simply to act as a bucket of information that will be processed or transformed into a domain at some later point. In these situations taking a DDD approach would add extra complexity with no benefit.

Recognising the difference between these types of models is key to choosing the right approach. This will come from a strong understanding of the domain in which your software operates in, recognising the domain contexts this leads to and ensuring these are implemented in the proper way. Other models in your code that exist purely to ease the flow of data through the system can be implemented in a more relaxed manner.

It is very easy for developers to become distant from the domain their code operates in, this doesn't make them bad engineers but it does hinder their ability to properly model their business in the code base. Make an attempt to understand your business domain and you'll surprised how it improves your code.

Sunday 6 September 2020

Event Sourcing

 

Traditionally data storage systems store the current state of a domain at a particular moment in time. Whilst this is a natural approach, in some complex systems with a potentially rapidly changing data set the loss of the history of changes to the domain can lead to a loss in integrity, or an inability to recover from errors.

Event sourcing is an approach to help solve this problem by storing the history of changes to the domain rather than just the current state.

This won't be suitable or advantageous in every situation but it is a technique that can offer a lot of benefits when an ability to cope with large numbers of concurrent changes is of upmost importance.

The Problem

Most applications will use a Create, Retrieve, Update, Delete (CRUD) approach to storing the current state of the domain.

This leads to a typical workflow of retrieving data, performing some form of modification on that data and then committing it back to the data store. This will often be achieved by the use of transactions to try to ensure consistency in the data when it may be modified by multiple concurrent processes.

This approach can have downsides, the locking mechanisms used to protect data integrity can impair performance, transactions can fail if multiple processes are trying to update the same data set and without any kind of auditing mechanism this can lead to data loss. 

Event Sourcing Pattern

Event sourcing attempts to address these issues by taking a different approach to what data is stored. Rather than storing the current state of a domain and having consumers perform operations to modify it, an event sourcing approach creates append only storage that records the events that describe the associated changes to the domain model they relate to.

The current state of the domain is still available by determining the aggregation of all events on the domain model, but equally the state of the domain at any point during the accumulation of those events can also be determined. This allows for the history of the domain to walked back and forward, allowing the cause of any inconsistency to be determined and addressed.   

The events will usually have a meaning within the context of the domain, describing an action that took place with its associated data items and implications for the underlying domain. Importantly they should be immutable meaning they can be emitted by the associated area of the application and processed by the data store in an asynchronous manner. This leads to performance improvements and reduces the risk of contention in the data store.   

When to Use It

Event sourcing is most appropriate when your system would naturally describe it's domain in terms of a series of actions or state changes, for example a series of updates to a shopping cart followed by a purchase. It is also an advantageous approach when your application has a low tolerance for conflicting updates in the domain causing inconsistency in the data set.  

An event sourcing approach can itself come with downsides so it is not always appropriate for all applications. Most importantly whenever the state of the domain is visualised the system will only be eventually consistent, the asynchronous nature of event processing means any view of the data might not yet reflect events that are still being dealt with. If your application needs to be able to present an always accurate real time view of the data then this form of data store may not be appropriate.

It is also likely to add unwarranted overhead for an application with a simple domain model with a low chance of conflicting modifications causing a loss of integrity.

As with most patterns in software engineering one size does not fit all. If you recognise the problems described here in the operation of your application then event sourcing maybe an alternative approach you could benefit from. However if you applications domain model is relatively simple and you aren't regularly dealing with problems caused by inconsistency in your data store then there is probably no reason to deviate from your current approach.