Transactions ensure that the database remains in a consistent state in spite of system failure and concurrent execution of transactions. However, it is common for databases to offer multiple “isolation levels” that do not prohibit all inconsistencies that arise from the concurrent execution of transactions. The ANSI SQL standard codifies several of these “isolation levels”, but fails to characterize anomalies that arise under modern isolation techniques. This article describes a more general model for understanding modern transaction isolation techniques along with specific examples and implementation notes for PostgreSQL.
A transaction’s active period is the time between the first statement in the transaction and commit or abort time. If two transactions have overlapping active periods then we say they are concurrent. For example, the following diagram depicts the execution of four transactions (T_1 \ldots T_4) where each statement executed is denoted by a node. T_1’s active period is from 0 to 2, and T_2’s active period is from 1 to 5. Overlapping active periods are highlighted. So, T_1 and T_2 are concurrent, as well as T_2 and T_3.
If an interleaving of transactions is equivalent to some nonconcurrent execution of those transactions (i.e. running them one at a time in some order), then we call that interleaving serializable.
To illustrate what can go wrong with modern isolation techniques, consider this concrete example that is possible at PostgreSQL’s default isolation level: Suppose that we have two transactions T_1 and T_2.
T_1 is defined as:
:= select amount from account where name = 'John';
v update account set amount = amount + v where name = 'Jane';
update account set amount = 0 where name = 'John'
and T_2 is defined as:
:= select amount from account where name = 'John';
v update account set amount = amount + v where name = 'Ted';
update account set amount = 0 where name = 'John'
then a possible interleaving of these two transactions is:
:= select amount from account where name = 'John';
v_1 := select amount from account where name = 'John';
v_2 update account set amount = amount + v_1 where name = 'Jane';
update account set amount = amount + v_2 where name = 'Ted';
update account set amount = 0 where name = 'John'
update account set amount = 0 where name = 'John'
John’s original balance has been transferred successfully to two different people, creating currency due to a serialization anomaly. Let’s identify this serialization anomaly a bit more clearly:
T_1 and T_2 both first read the amount from John’s account — then, at a later point, they both update John’s account. T_1 observed the state of John’s account that exists prior to T_2’s modification. Thus, if there exists an equivalent serial execution of these transactions, it must be that T_1 precedes T_2. Likewise, T_2 observed the state of John’s account before T_1’s modification. Thus, T_2 precedes T_1 in any equivalent serial execution of the transactions. These constraints are contradictory — therefore, this interleaving is not equivalent to any serial execution of T_1 and T_2.
It is noisy to write out SQL to illustrate these different interleavings, so let’s establish a more compact notation.
A history models the interleaved execution of a set of transactions as a linear ordering of their actions (e.g. read, write, commit, abort) on specific rows.
Reads, writes, commits, and aborts are denoted as r, w, c, and a respectively. When a transaction T_1 reads row x we denote this as r_1(x). Likewise w_1(x) indicates that T_1 wrote to row x (inserts, updates, and deletes are all examples of writing to a row). A predicate-read (e.g. select ... from ... where some_column
> some_value
) where row x satisfies the predicate is denoted r(x
\in P).
Using our notation we could describe the above interleaving more succinctly as:
r_1(x) \; r_2(x) \; w_1(y) \; w_2(z) \; w_1(x) \; c_1 \; w_2(x) \; c_2
The ANSI-SQL standard defines four isolation levels by the “phenomena” that each level proscribes.
The “phenomena” are examples of histories that are not serializable — however, the standard defines these histories ambiguously. (Berenson et al. 1995) defines a “broad” and a “strict” interpretation of each ambiguous phenomenon. They point out a number of anomalous histories permitted by the strict interpretation that seem to go against the author’s intentions and thus conclude that the ANSI standard authors must have intended the broad interpretation. The broad interpretation is defined as follows:
\begin{align*} P0 \; (\text{dirty write}): \quad &w_1(x) \ldots w_2(x) \ldots (c_1 \text{ or } a_1) \\ P1 \; (\text{dirty read}): \quad &w_1(x) \ldots r_2(x) \ldots (c_1 \text{ or } a_1) \\ P2 \; (\text{nonrepeatable read}): \quad &r_1(x) \ldots w_2(x) \ldots (c_1 \text{ or } a_1) \\ P3 \; (\text{phantom read}): \quad &r_1(x \in P) \ldots w_2(x) \ldots (c_1 \text{ or } a_1) \end{align*}
Isolation Level | P0 (Dirty Write) | P1 (Dirty Read) | P2 (Nonrepeatable Read) | P3 (Phantom Read) |
---|---|---|---|---|
Read Uncommitted | Not Possible | Possible | Possible | Possible |
Read Committed | Not Possible | Not Possible | Possible | Possible |
Repeatable Read | Not Possible | Not Possible | Not Possible | Possible |
Serializable | Not Possible | Not Possible | Not Possible | Not Possible |
These definitions were meant to be general, permitting different implementations of isolation. Unfortunately, they (accidentally) proscribe optimistic implementations of transaction isolation. For example, PostgreSQL permits P1’s history but the reader does not observe a dirty value because PostgreSQL provides transaction isolation through multiversion-concurrency-control. Thus, the reader observes the row value that exists prior to the writer’s update.
An additional problem with this description of isolation is that merely stating some non-serializable histories that are proscribed at each level doesn’t give a very good understanding of what can go wrong.
Our current notion of a history cannot model multiversion histories because we have no way of expressing a row’s version. To remedy this we follow (Adya, Liskov, and O’Neil 2000), by augmenting our histories with a version order, \ll, that is a total order on committed versions of each row. The version order, \ll, represents rows by a total order of row versions. A row version is denoted x_i where x indicates the row and i indicates the version. Updating a row is then modeled as installing the successor version. For instance, if some row x_i exists and a transaction modifies that row, we model this as installing the successor version of x_i: x_j and we may indicate that x_j is a successor of x_i with the notation x_i \ll x_j. The version order of each object x contains exactly one initial element x_{\text{init}}, and at most one committed dead version, x_{\text{dead}}. The initial or “unborn” version conceptually exists for every possible row. We model inserts as installing the successor version of some unique unborn version. Similarly, we model deletion as installing the successor version x_{\text{dead}} for some row x. These “unborn” and “dead” versions are not visible (i.e. they cannot be observed by a read), but simplify some concepts described later.
In the introductory example we saw that some interactions between concurrent transactions imply that one precedes the other. We can visualize these constraints as a serialization graph, that is, a directed graph where the nodes are transactions and the edges are constraints. An edge from T_i to T_j then indicates that in any equivalent serial execution of T_i and T_j, T_i must precede T_j. We say two histories are equivalent if they have equivalent serialization graphs. Thus, a history is serializable if and only if it’s equivalent to some nonconcurrent history, that is, its serialization graph is acyclic. Indeed, a cycle indicates a contradiction (e.g. T_i must precede T_j and T_j must precede T_i). We follow the model of transaction dependencies described in (Adya, Liskov, and O’Neil 2000).
Write–read dependencies occur when a transaction T_2 reads a row version installed by some other transaction T_1. Thus, if there exists an equivalent serial execution of the transactions then T_1 must precede T_2. This constraint is denoted T_1 \xrightarrow{wr} T_2\text{.}
Read-write dependencies occur when a transaction T_1 reads a row version x_i and some other transaction T_2 installs the immediate successor, x_j. Thus, if an equivalent serial execution of the transactions exist, T_1 must precede T_2 and is denoted T_1 \xrightarrow{rw} T_2.
When a transaction executes a read or write based on a predicate P (e.g. select ... from ... where column > some_value
), the system selects a row version for each tuple in P’s relations. The set of selected versions is called the version set of this predicate-based operation and is denoted by \text{Vset}(P).
The version set defines the state that is observed to evaluate a predicate P. The predicate P is applied to the versions in \text{Vset}(P) to determine which tuples satisfy P.
We say a transaction T_i changes the matches of a predicate-based read observing a set of matches M if T_i installs x_i where x_h immediately precedes x_i in the version order, and ((x_h \in M \land x_i \notin M) \lor (x_h \notin M \land x_i \in M)).
Predicate write–read dependencies occur when a transaction T_2 executes a predicate-based read for some predicate P and another transaction T_1 changes the matches of P by installing x_i where x_i \in \text{Vset}(P) or \exists x_j \in \text{Vset}(P) such that x_i \ll x_j. Thus, if an equivalent serial execution of the transactions exist, T_1 must precede T_2 as T_2 observed the database state that results from T_1’s effects. This is denoted T_1 \xrightarrow{wr_{pred}} T_2.
Note that this includes the case where T_1 deletes a row x (i.e. installs the successor version x_{\text{dead}}) if the immediate predecessor of x_{\text{dead}} matched P.
Predicate read–write dependencies occur when a transaction T_1 performs a predicate-based read for some predicate P and another transaction T_2 installs a row version x_i where \exists x_h \in \text{Vset}(P) such that x_h \ll x_i and x_i changes the matches of P. Thus, in any equivalent serial execution of the transactions, T_1 must precede T_2 as it observed the database state that exists prior to T_2’s effects. This is denoted T_1 \xrightarrow{rw_{pred}} T_2.
Note that this includes the case where T_2 inserts a row (i.e. installs the successor version to x_{\text{init}}) if the installed version matches P.
We are now prepared to model histories as a serialization graph — here are the serialization graphs for a few anomalies:
Proscribed by read uncommitted.
w_1(x_1) \; w_2(y_1) \; w_1(y_2) \; w_2(x_2) \; c_1 \; c_2
Proscribed by read committed.
w_1(x_1) \; w_2(y_1) \; r_1(y_1) \; r_2(x_1) \; c_1 \; c_2
Proscribed by repeatable read.
r_1(x_1) \; w_2(x_2) \; c_2 \; r_1(x_2) \; c_1
Proscribed by serializable.
r_1(P \text{ such that } x_\text{unborn} \in \text{Vset}(P)) \; w_2(x_1) \; c_2 \; r_1(x_1 \in P) \; c_1
Seeking to remove the unnecessary constraints in the ANSI standard, (Adya, Liskov, and O’Neil 2000) proposes the following phenomena and isolation levels that don’t proscribe optimistic and multi-version concurrency implementations:
A history H exhibits phenomenon G0 if its serialization graph contains a directed cycle consisting entirely of write-dependency edges.
A history H shows phenomenon G1a if it contains an aborted transaction T_1 and a committed transaction T_2 such that T_2 has read some object (maybe via a predicate) modified by T_1. Phenomenon G1a can be represented using the following history fragments:
\begin{align*} & w_1(x_i) \ldots r_2(x_i) \ldots & (\text{$a_1$ and $c_2$ in any order}) \\ & w_1(x_i) \ldots r_2(x_i \in P) \ldots & (\text{$a_1$ and $c_2$ in any order}) \\ \end{align*}
A history H shows phenomenon G1b if it contains a committed transaction T_2 that has read a version of object x (maybe via a predicate) written by transaction T_1 that was not T_1’s final modification of x. The following history fragments represent this phenomenon:
\begin{align*} & w_1(x_i) \ldots r_2(x_i) \ldots w_1(x_j) \ldots c_2 \\ & w_1(x_i) \ldots r_2(x_i \in P) \ldots w_1(x_j) \ldots c_2 \end{align*}
A history H exhibits phenomenon G1c if its serialization graph contains a directed cycle consisting entirely of ww, wr, and wr_\text{pred} edges. Note that disallowing G1c implies that G0 is also disallowed.
A history exhibits G2\text{-item} if its serialization graph contains a directed cycle having one or more item-anti-dependency (rw) cycles
Isolation Level | Phenomena disallowed | Informal description (T_i can commit only if:) |
---|---|---|
PL-1 | G0 | T_i’s writes are completely isolated from the writes of other transactions |
PL-2 | G1 | T_i has only read the updates of transactions that have committed by the time T_i commits |
PL-2.99 | G1, G2\text{-item} | T_i is completely isolated from other transactions with respect to data items and has PL-2 guarantees for predicate-based reads |
PL-3 | G1, G2 | T_i is completely isolated from other transactions, i.e., all operations of T_i are before or after all operations of any other transaction |
PostgreSQL’s transaction isolation documentation follows the terminology of the 1992 ANSI SQL specification. Note that PostgreSQL differs in that it proscribes dirty reads at all isolation levels, and proscribes phantom reads at repeatable read.
Isolation Level | Dirty Read | Nonrepeatable Read | Phantom Read | Serialization Anomaly |
---|---|---|---|---|
Read Uncommitted | Not Possible | Possible | Possible | Possible |
Read Committed | Not Possible | Possible | Possible | Possible |
Repeatable Read | Not Possible | Not Possible | Not Possible | Possible |
Serializable | Not Possible | Not Possible | Not Possible | Not Possible |
PostgreSQL implements isolation through multiversion-concurrency-control (mvcc). When an update occurs, a new row version is installed rather than overwriting the existing version. Each statement is executed in the context of a snapshot that determines which row versions are visible. There can be at most one visible version of each row per snapshot.
A simplified sketch of the implementation is as follows: When a transaction begins it acquires a unique transaction id. A snapshot contains sufficient information to determine if a given transaction id was committed at the time the snapshot was taken. Row versions are tagged with the transaction id that created them (xmin
) and, if applicable, the transaction id that deleted them (xmax
). Thus, a transaction’s visible row version (if it exists) is the unique row version where the xmin
was committed prior to the snapshot and the xmax
does not exist or was not committed at the time of the snapshot.
This series is a great resource if you want to understand PostgreSQL’s mvcc implementation in more detail.
To proscribe G0, PostgreSQL employs the traditional solution of long write locks1. Indeed, this ensures that transactions are serialized with respect to their writes — i.e. for any two concurrent transactions T_1, T_2, all of T_1’s writes precede any of T_2’s writes or all of T_2’s writes precede any of T_1’s writes. Thus, cycles comprised entirely of ww dependencies cannot occur. Note that serializing writes ensures that row versions do not overlap, thus ensuring that for all snapshots, at most one row version is visible.
The traditional solution of short-term read locks is not used to proscribe G1, PostgreSQL instead relies upon mvcc. At read committed, a new snapshot is created for every statement — thus, only row versions committed prior to the statement are visible. This prevents aborted and intermediate reads, and serializes transactions with respect to their wr and wr_\text{pred} dependencies. Together with long write locks, this proscribes circular information flow.
Thus, cycles comprised entirely of ww, wr, and wr_\text{pred} dependencies are proscribed at read committed. However, any transaction dependency described above may occur between concurrent transactions. Some common anomalies at read committed are:
T_1 observes an inconsistent view of x and y. Nonrepeatable read is a degenerate form of read skew.
T_1 reads x, T_2 updates x, then T_1 (based on its earlier read value) updates x and commits.
Note also that serializable histories need not agree with the commit order. For example:
This history is equivalent to the serial history T_1 ; T_2. Thus it is serializable, despite the fact that T_2 commits before T_1.
PostgreSQL uses snapshot isolation for its repeatable read implementation. With snapshot isolation a transaction T_1 only sees row versions that were committed prior to T_1’s active period — thus, no effects from concurrent transactions are visible.
Additionally, if concurrent transactions attempt to write to the same row then the first to commit wins and all other transactions at repeatable read or higher are aborted with a “failed to serialize due to concurrent write” exception. This is known as the “first committer wins” rule.
The “first committer wins” rule prevents the lost update anomaly — indeed, we cannot have write–write dependencies between concurrent transactions (i.e. T_1 would throw an exception at w_1(x_2)).
Note that an application needs to be prepared to receive these “failed to serialize” exceptions and react accordingly, usually by retrying the transaction. This is a departure from read committed, although deadlock exceptions can occur at any level.
Snapshot isolation is both stronger and weaker than the broad interpretation of repeatable read — it is stronger because it proscribes phantom reads, but it is weaker because it does not proscribe G2-item anomalies. Thus, snapshot isolation does not meet the standard for repeatable read as defined by (Berenson et al. 1995).
Snapshot isolation prohibits ww, wr, and wr_\text{pred} dependencies between concurrent transactions. This eliminates the cycle in the “read skew” example given previously:
as T_1 continues to read y_0 despite T_2 installing the successor of y_0, y_1.
Although ww, wr, and wr_\text{pred} are prohibited between concurrent actions, no attempt is made to track rw and rw_\text{pred} dependencies. Consider the following anomaly examples that are permitted at repeatable read:
A rw cycle is known as “write skew”. To gain a better understanding of how this might occur consider this example from (Fekete et al. 2005):
Suppose x and y are two rows containing integers, and we have the constraint that x + y \geq 0. Assume that initially x_0 = 70 and y_0 = 80. Under Snapshot Isolation, T_1 reads x_0 and y_0, then subtracts 100 from x, assuming it is safe because the two data items add up to 150. T_2 concurrently reads x_0 and y_0, then subtracts 100 from y, assuming it is safe to do so for the same reason. Each update is acting consistently, but SI results in the following history:
r_1(x_0, 70) \; r_2(x_0, 70) \; r_1(y_0, 80) \; r_2(y_0, 80) \; w_1(x_1, -30) \; c_1 \; w_2(Y_2, -20) \; c_2
The final committed state violates the constraint that x+y > 0. This violation is not detected by the “first committer wins” rule since the transactions update separate rows, each under the assumption that the other row remained stable.
Another example from (Fekete et al. 2005) yields some interesting observations. Consider the following history:
This history is serializable, equivalent to T_1; T_2; T_3, but if T_3 reads an additional row:
Then a cycle is formed and the history is no longer serializable. There are two notable things about this anomaly:
T_1 and T_2 are already committed when the cycle is formed, so it is necessary to continue tracking rw dependencies of transactions even after they commit. Indeed, rw dependencies must be tracked until all concurrent transactions have committed. This is one reason that long-running serializable transactions are expensive.
This cycle is formed by a read-only transaction: T_3. So, read-only transactions can cause anomalies under snapshot isolation. In this example, T_3 observed T_2’s changes, so it must succeed T_2, but it also observed z_0, so it must precede T_1. The constraint T_1 \xrightarrow{rw} T_2 contradicts the constraint T_2 \xrightarrow{wr} T_3 \xrightarrow{rw} T_1.
The important insight to take away here is that snapshot isolation prohibits ww, wr, and wr_\text{pred} dependencies between concurrent transactions, but permits rw and rw_\text{pred}.
Serializable prohibits all serialization anomalies. This affords the programmer a unique and valuable guarantee:
If every transaction in a set preserves consistency, then any history over that set preserves consistency.
Serializable is implemented in PostgreSQL following the algorithm described in (Cahill, Röhm, and Fekete 2008) which employs the proof from (Fekete et al. 2005) that any anomaly occurring under Snapshot Isolation contains a node that has an rw edge in and an rw edge out. This node is called a pivot transaction.
This enables the heuristic used by PostgreSQL to ensure serializability: For every transaction, PostgreSQL tracks if it has a read-write conflict in and a read-write conflict out. If this “dangerous structure” is found, a transaction is aborted to prevent the pivot from being formed. Note that this heuristic allows for false positives — it will abort a transaction in serializable histories that contain this “dangerous structure”. Thus, at serializable it is common to see errors that a transaction was aborted on identification as a pivot. This means it had a rw conflict in and a rw conflict out.
In order to track rw conflicts, PostgreSQL needs to track which rows are read by serializable transactions. When a serializable transaction reads an object it takes a “predicate lock”2 on that object. These “predicate locks” are presented in the pg_locks
view, tagged with the mode SIReadLock
.
Read–write dependencies are monitored not only for rows but also for index pages. This is how predicate read–write dependencies are tracked. For example, if in processing the query select ... where id
> 100
a b-tree index on id
is scanned then an SIReadLock
is acquired on each index page read. If a concurrent transaction inserts with id = 101
then it will write to an index page that a predicate lock is held for. Incurring a predicate read–write dependency between these two transactions.
Consequently, If the same set of index pages are being read and written to concurrently then serialization conflicts may occur at serializable, even if each transaction is updating a distinct row.
If a number of transactions are concurrently modifying the same set of index pages and the application is reacting to serialization failures by retrying immediately, this can result in thrashing as the set of transactions keeps retrying and failing to serialize.
Note also that serializable transactions are only serializable with respect to other serializable transactions, as rw dependencies are not tracked for transactions running at repeatable read or lower.
PostgreSQL allows for explicit locking where reads can be instructed to take a lock that would block writes (select ... for share
), and one might wonder what level of isolation is achievable by augmenting read committed with this approach. It is possible to meet the ANSI SQL repeatable read standard. Indeed, this is the traditional approach (long write locks, long data-item read locks). However, unlike PostgreSQL’s repeatable read implementation based on snapshot isolation, phantom reads are not proscribed. In effect, this serializes transactions with respect to ww, wr, and rw dependencies, but fails to serialize them with respect to wr_\text{pred} and rw_\text{pred} dependencies.