Post-commit hooks in MAAS

How to not change the world from inside a transaction

Near the end of Transactions in MAAS I mentioned post-commit hooks. These are a mechanism that MAAS uses for making changes to external systems once a database transaction has been committed.

Explain!

Database transactions in any piece of software are not guaranteed to be committed, be it because of bugs, errors, choice, or because the database rejects the transaction due to a serialisation conflict.

This makes it less than safe to change the outside world from within a transaction. Transactions can be rolled-back, but things Out There often do not have that property.

Suppose I make an EatPizza RPC call to a pizza-eating robot’s web API within a transaction. That transaction later fails and is retried n times by MAAS. The robot would rupture and short-circuit from the n + 1 pizzas in its belly.

MAAS has to interact with a lot of things outside of the database, and that do not have database-like properties. When MAAS needs to perform such an interaction it arranges for it to happen in a post-commit hook, which are run once the current transaction has been fully committed by PostgreSQL.

Detour into two-phase commit land

PostgreSQL supports two-phase commits through PREPARE TRANSACTION.

After this command, the transaction is no longer associated with the current session; instead, its state is fully stored on disk, and there is a very high probability that it can be committed successfully, …

This sounds ideal for implementing post-commit hooks. The problem is that those hooks can take an unbounded time to execute, during which that dorment transaction is hanging around, causing serialisation failures in other transactions.

Behaviour

This is roughly how MAAS’s post-commit hook mechanism works:

post_commit() will return a Deferred which will fire once the transaction commits.
If the transaction is aborted, all hooks will be cancelled.
If a savepoint is rolled-back, those hooks registered within the savepoint will be cancelled.
A failure in a hook will result in all subsequent hooks being cancelled.
Hooks are tracked per-thread, the same as Django tracks database connections.
Hooks are fired in the order they are registered: first-in, first-out.
All hooks are called in the Twisted reactor and not in the originating thread.

Why the reactor?

That last point sounds particularly weird. There are two reasons for it:

Post-commit hooks are most often used to make RPC calls. RPC in MAAS is all Twisted, so it’s convenient to already be in the reactor.
Database access is forbidden in the reactor, and attempting it will result in an exception (this is something we added for MAAS). This helps to prevents inadvertent post-commit scheduling of transactional code.

Issues

Hooks, when cancelled, don’t know if they are being cancelled because the transaction failed to commit or because a previous hook failed.
If a post-commit hook must make a transactional change, it must defer it to a thread. This means a single logical request can briefly hold 2 database connections. Connections are a limited resource so this harms throughput.

To my knowledge neither of these have harmed us in practice. However, they are deficiencies and it’s worth appreciating them when working on MAAS’s code.

We could address #1 by propagating errors from failing hooks into subsequent hooks instead of cancelling. We haven’t done this yet to avoid complex and fragile dependencies arising between hooks, and to avoid over-logging.

\#2 is harder to solve. It’s a symptom of the inside-out logic that MAAS has evolved with…

Inside-out logic, or: the pitfalls of building a distributed system with a web framework

MAAS is distributed: the region talks to the clusters, and vice-versa. The mechanism by which it does this looks like RPC but underneath it’s message-based and asynchronous. Communications between MAAS’s components thus have a low overhead: messages are small, are multiplexed over a minimal number of connections, and waiting for replies ties up few resources beyond a callback reference. Talk is cheap :)

In contrast, every web-service request coming into MAAS is handled by Django; Twisted hands over to Django via WSGI early on. This means that each request takes up a thread and a database connection.

This is fine when the request can be immediately serviced and a response sent, but MAAS is distributed and sometimes must wait for the information it needs to service a request. Waiting gets expensive when holding a database connection for an indeterminate time.

Since MAAS’s inception it has used Django. Much of MAAS’s logic has become wedded to Django’s way of doing things, which is entirely request / thread / database centric. Unfortuntely this warps our thinking and limits how MAAS can scale.

Centrality of the database

A particular problem is the centrality of the database. In many web applications the database is the ultimate source of truth, but in a distributed system it can be a murkier mix of user-supplied and externally-observed information. The latter is like a cache, in that you must be prepared for it to be stale the moment you use it: reality is the ultimate source of truth.

MAAS is both distributed and has to deal with a lot of reality.

Concurrency is artificially limited

The lamprey-like coupling of a database connection to a request and thread limits concurrency within MAAS’s web-service to the number of connections we can open, not whether or not that connection is in use or even needed.

Putting the outside outside again

MAAS needs to shrug off the database-centric approach. Its driving logic needs to think of messaging, distributed data, delays, failure and partial failure, retries, and recovery. The database will be one of the many components MAAS communicates with, no longer the ever-present companion to all web-service operations within MAAS¹.

Post-commit hooks are the way we get sane behaviour out of MAAS while we make that transition. They allow us to safely embed “do something external” code within existing database-centric code: inside-out logic.

Post-commit hooks are an imperfect stop-gap and, in time, will be removed from MAAS.

The WebSocket service created for the new UI exhibits some of this new approach. It uses Django’s ORM but not its request/response framework. Most of the handlers do end up running in a thread, in a transaction, but this is still a significant improvement.