February 23rd, 2017, 18:43 UTC

MAAS Python Testing

The importance of running tests quickly: Here, MAAS, and everywhere

MAAS’s inception date was 16th January 2012 and it has been continually developed ever since, including the development of many, many unit tests. Since almost the beginning we’ve had a landing robot that runs those unit tests before merging a new branch into trunk.

At the time of writing it runs 14337 tests, and that number grows daily. Until recently the landing robot would take over an hour to test and merge each branch.

This is too slow — and I’ll explain why I think this — and this is how my journey to fix it began.

My laptop — and, until recently, sole development machine — was slower than the landing robot. My new and much better specified desktop machine still takes more than 30 minutes to get through the test suite.

Hence landing a change takes a long time. If the robot hits a test failure I am probably immersed in the next piece of work before I find out. But running the test suite locally can take an age too, posing the same problem, and begging the question: why bother running the whole test suite locally? Just let the robot do it, hope for the best, and pick up any pieces whenever there’s time.

What does this all add up to?

  • Long queues, because each branch takes so long to process. When many developers have something to land, each much wait in line for a slot with the lander.

  • Not running the full test suite locally means that lurking failures will instead be found by the landing robot, at which point I will need to fix my branch and submit at least once more. This means longer queues and a longer overall time to land for each branch.

  • Long queues mean the delta between my branch and trunk may have changed by the time it reaches the front of the queue. My branch may now conflict, or may no longer exhibit the tested behaviour because of a nearby change. I will need to fix my branch and resubmit, contributing to longer queues and a longer overall time to land.

  • A branch may be rejected by the landing robot for reasons outside of my control, like from conflicts — of the source and logically in the application — or from spurious failures.

    This kind of failure cannot be prevented by running the full test suite locally, but the penalty imposed — one slot of lander time — is carried by everyone with a branch in the queue. The larger that slot the larger the cost to the team.

  • Long waits for code to land means lower team velocity. We all have to wait longer before being able to build on other’s work.

You can see that each of these feeds into the others: it’s a vicious spiral. This situation can develop very gradually, over years, making it one of the more insidious problems the MAAS team had to face. I don’t know how to measure or estimate the overall effect, but I hazard that an increase in test run time results in a development penalty many multiples larger.

We’ve addressed this in a big way in MAAS recently, reducing the robot’s landing time to 15 minutes, of which the tests themselves take fewer than 11 minutes. On my new development machine I can run the full test suite in less than 6 minutes, or the time it takes to make a cup of tea.

The next post explains how I did this.