The way to run tests quickly (in MAAS)

Go parallel and deal with the mess for big wins

This post explains how I went about making MAAS’s test suite run much more rapidly. A previous post lays out the ground and explains why it was necessary.

The story is: split up the test suite and run those parts concurrently. As always, the Devil and the interesting bits are in the detail.

How MAAS tests itself

MAAS has several distinct test suites and test scripts, each running with a different environment (Python packages installed, configuration loaded, etc.). Historically this has been of more importance than it is today, so let’s consider them in just two groups:

Test suites/scripts that use a database:
- bin/test.region
Test suites/scripts that do not use a database:
- bin/test.cli
- bin/test.rack
- bin/test.testing
- bin/test.js

When someone refers to the MAAS test suite they mean the sum of the above. The canonical definition is in the project’s Makefile; see the test target.

With the exception of test.js these all run Nose. The region tests, run via test.region, also use the django-nose plug-in. This is okay but makes test.region feel different from the other test scripts so I was keen to remove this dependency.

Anyway, why consider them separately? The database tests required a much greater effort to get working in parallel than the non-database tests. Most of the rest of this post will be about them.

Following my Nose gets me nowhere

Nose will split up a test suite and run it in parallel with the --processes option but in practice this did not work for MAAS. Database tests broke hopelessly: concurrent tests were trying to use the same database and conflicting with one another. The tests run by test.rack failed in large numbers too, with errors about bad file descriptors and suchlike. I played around with this for a while until I was reasonably confident that I should instead invest my time elsewhere.

Creating a supervisor

I did look around for something off-the-shelf to do what was needed but came up short. Somewhat reluctantly I chose to create a bespoke supervisor to run the suites concurrently. At a minimum this would:

Spawn test scripts with the correct arguments.
Collect test results from each suite and report them to the user.
Do #1 and #2 in parallel then exit zero on success or non-zero on failure.

In retrospect this worked out well. The code to do this was not long or complex, and it suits the needs of MAAS well.

1. Spawning test scripts

Nose’s support for parallel testing uses multiprocessing and I suspect the problems I encountered with Nose were related to bad interactions between multiprocessing and… stuff in Django, Twisted, and MAAS itself. I planned instead to spawn a sub-process for each chunk of tests.

Now, multiprocessing does use sub-processes, but there’s a certain amount of magic going on behind the curtain. With persistence I might have been able to get everything working with multiprocessing, and thus with Nose, but I chose to cut my losses early on: I could see that decoupling the runner from the supervisor would be a better architectural choice — multiprocessing being a fairly close coupling — and I had in mind a good way to do this.

Spawning sub-processes is prosaic, and that’s the appeal here, to keep it simple. However, this choice alone was enough to get the non-database test suites behaving nicely together, or mostly so, even when split into smaller chunks.

2. Streaming results

This one is, ostensibly, simple: use subunit.

There’s even a Nose plug-in: nose-subunit. Sadly it didn’t work.

Writing a new one is easy though because Nose’s plug-in API is fairly good. Take away the boilerplate and it reduces to:

class Subunit(nose.plugins.base.Plugin):
    def prepareTestResult(self, result):
        return subunit.TestProtocolClient(sys.stdout.buffer)

See the Subunit class in src/maastesting/noseplug.py for the whole thing.

Piggy-backing on stdout means that Nose or any test can corrupt the result stream by calling print() at the wrong moment, so I also added an option to use a different file descriptor.

Receiving the results in the supervisor was easy too thanks to subunit, but I had to work around a small bug:

server = subunit.TestProtocolServer(result, sys.stdout.buffer)
# Don't use TestProtocolServer.readFrom because it blocks until
# the stream is complete (it uses readlines).
for line in preader:
    server.lineReceived(line)
server.lostConnection()

preader is the read end of a pipe; the write end is passed into the subprocess in which the test script is exec’d.

3. Running concurrently

Here I used testtools’ ConcurrentTestSuite.

This runs each test in a suite in a separate thread. The supervisor constructs this suite and populates it with several stub tests, the number of which corresponds to the amount of concurrency desired, e.g. the number of CPU cores.

Each of these stub tests, running in its own thread, is not a test in the usual sense. In truth it’s a worker with a very simple run method:

def run(self, result):
    for test in iter(self.queue.get, None):
        if result.shouldStop:
            break
        else:
            test(result)

The queue it pulls from has been populated with callables. Each of these, when called, spawns a test script in a separate process and runs a small chunk of actual tests, all of which report into result.

It’s hard to predict ahead of time how long each chunk of tests will take. Make the chunks too large and towards the end of a run some or most of the workers will sit idle. Make the chunks too small and the overhead of set-up and tear-down will swamp whatever gains are made.

I didn’t experiment much to find the best balance, but the code is amenable to this kind of experimentation. Right now, given a concurrency of 4, say, each test script would also be split into 4 chunks: 5 test scripts would mean 20 chunks in total. It’s imperfect, but it has worked well enough so far.

The ConcurrentTestSuite also ensures that results are recorded in a thread-safe way into result, a TestResult instance. We use that at the end to know if there have been any failures and thus set the exit code.

See src/maastesting/parallel.py for the code.

Database tests still break

I anticipated this: I hadn’t done anything to prevent conflicts between concurrently running tests trying to use the same database.

I had the following in mind when thinking about a fix:

Cease use of django-nose, as mentioned earlier.
Stop using Django’s deprecated fixture-based testing mode.
Avoid using Django’s new migration-based testing mode because it’s incredibly slow.
Stop using Django as much as possible.

MAAS uses Django’s ORM heavily and that will remain so. The same goes for Django’s form machinery. However, the MAAS region has long since outgrown Django, and testing everything via a Django-centric test runner is both inappropriate and slow. It’s also different to how tests are run in other parts of MAAS, making maintenance harder.

I built a mechanism based on testresources and postgresfixture:

Model a PostgreSQL cluster as a test resource.
Model a pristine database within that cluster as a test resource. This is created by running database migrations, just as for a new production database.
Model test databases within the same cluster as test resources. These are cloned from the pristine database as needed.

Tests that need to use the database can declare that they need a test database as a resource. The testresources library then arranges for all dependencies to be created — or recreated, or reused — before creating the test database and configuring Django to use it.

Tests that wrap themselves in a transaction and rollback at the end leave a clean database which is reused for the next test. Tests that commit leave a dirty database which is torn down.

Tests that don’t need a database don’t get a database, and no upfront work is done.

I added an optimisation: instead of running migrations into an empty database, it first runs in an initial.sql script that brings the database up to date, or nearly so. I remembered this trick from Jeroen Vermeulen who had done it years earlier, most likely for Launchpad’s test suite, though neither of us can remember precisely. This initial state is captured from a previous run of migrations and is checked into the MAAS source tree so everyone benefits. For me this speeds up the creation of the pristine database from >50 seconds to ~2.5s, or by more than 2000%. Migrations are slow.

The correctness of this shared cluster and shared pristine database arrangement relies heavily on the correctness of postgresfixture. Early on in this work I found it had a serious locking bug. Fortunately I am also the maintainer of postgresfixture, so that was fixed very quickly :)

See src/maasserver/testing/resources.py for the code.

It works!

There were a few more bits to be done:

Get Nose to use an testresources.OptimisingTestSuite. This reorders tests to minimise the total cost of set-up and tear-down of those test databases.
Split up the test suites further (bin/test.rack completes much quicker than bin/test.region for example) to make better use of multiple cores. This alone is worth a post.

and a lot of finessing, but it worked!

Many tests were failing, but not huge numbers, so it was clear that it was producing a superb improvement in overall run time. The speedup was linear up to roughly 6 concurrent processes, i.e. at a concurrency of 6 I was seeing run time about one sixth of that with no concurrency. Above that and the benefits still accrued but less impressively: constant overheads like the cluster and pristine database lifecycle had become significant.

I set about fixing the failing tests and improving overall stability until it could run the full test suite to completion without failure 10 times in a row. On 3rd February 2017 the MAAS lander was switched over to bin/test.parallel and it has been in use ever since.

In truth we didn’t actually change the lander: we changed the test target in Makefile and landed that, so this new runner became the default for both the lander and for developers running make test too.

Since then…

I have since updated test.parallel to:

Accept selectors so it can be used as a front-end for the individual test scripts. For example, to run all the model tests in parallel:
```
$ make bin/test.parallel
$ bin/test.parallel src/*/models
```
Collect coverage information from the test processes it spawns. There are Makefile targets to combine this information and produce reports.
Have configurable concurrency. Initially it was clamped at the number of CPU cores, minus 2.
Emit not just human-readable results but also subunit and JUnit XML so it can be integrated with, say, Jenkins.

I have also fixed a number of spuriously but infrequently failing tests. Running in parallel exposed some bad assumptions and bad interactions in the tests. Fortunatately fixing these tests often resulted in better application code too.

Something I’ve found useful is repurposing the parallel test runner to help reproduce these Heisenbug-like failures. By adding the following to the test case:

scenarios = [("#%d" % i, {}) for i in range(1000)]

then using test.parallel with a selector:

$ bin/test.parallel --subprocess-per-core \\
>   src/path/to/test.py:TestClass.test_spurious

you can very quickly run a large number of iterations — 1000 here — of the same test, raising the chances of triggering the failure. This is especially interesting when the failure only arises when running concurrently with other tests.

Last words

This work was done over the course of several months as an accumulation of many smaller changes. Meanwhile I was concentrating the majority of my time on new MAAS features. I did the work slowly. Each incremental change was valuable in its own right. I didn’t ask permission to do it but neither did I hide it, so I must thank my immediate manager for letting me complete this.