Porting MAAS to Python 3: The (More) Technical Bits
The MAAS team’s mission was to port MAAS from Python 2.7 to Python 3.5.
This post has a rundown of how we prepared, how we did the port itself, and what we learned. It follows on from Porting MAAS to Python 3 which gives an overview of the port. They’re both written from an engineering perspective, but this post contains a lot more technical detail.
What we did to prepare
We used a few features of Python 2.7 that are meant to help prepare for a port to Python 3, and we also devised one of our own. At the top of every module and script we:
-
imported
unicode_literals,absolute_imports, andprint_functionfrom__future__, -
selected new-style classes by default with
__metaclass__ = type, -
forbade the use of
strwithstr = None.
The latter forced the use of bytes or unicode and so brought decisions
about encoding and decoding to the fore. Sadly it couldn’t prevent implicit
coercion between these types.
With an irritating tenacity I would also recommend the use of
dict.view{keys,values,items} in code reviews.
In Python 2.7 these methods exhibit the closest behaviour to Python 3’s
dict.{keys,values,items}. They’re also converted cleanly by 2to3 whereas,
for example, dict.keys() is converted to list(dict.keys()) and
dict.iterkeys() is converted to iter(dict.keys()).
The intermediate lists that arise are somewhat wasteful and very often
unnecessary, but it’s hard for 2to3 to know this because the dict
fixer
doesn’t consider the context (and doing so may be a task more suited to a
human in any case). Using the dict.view* variants gives it a hint.
Unfortunately old habits die hard, and I ended up spending a lot of time
manually reverting these kinds of changes from 2to3’s patches.
The process
-
A bug in
2to3means that all__future__imports needed to be reformated onto a single line (see [reformat-future-imports-on-single-line.py] reformat-future-imports-on-single-line):bzr ls --kind=file --recursive --versioned --null | \\ xargs -r0 python python3/reformat-future-imports-on-single-line.py -
The
str = Nonelines also needed to be removed (seeremove-str-equals-none-shim.py):bzr ls --kind=file --recursive --versioned --null | \\ xargs -r0 python python3/remove-str-equals-none-shim.py -
We converted MAAS’s code directory by directory, but worked with patches instead of getting
2to3to write directly:2to3 --nofix=callable src/${subcomponent} > \\ python3/fix-${subcomponent}.diff -
We reviewed patches to sanity check them, and to remove unnecessary conversions. Commit each patch again, then apply:
patch -p0 < python3/fix-${subcomponent}.diff -
The
__metaclass__ = typelines and all remaining shims were next to go (seeremove-all-shims.py):bzr ls --kind=file --recursive --versioned --null | \\ xargs -r0 python python3/remove-all-shims.py -
We got the tests passing, committing as we went. Problematic tests were skipped like so:
@skip("PYTHON3-TEMPORARY-DURING-PORT")
We did this last step for tests that depended on code that had not yet ported. Instead of pushing that work onto the stack we would just skip the tests and move on. Later on we revisited these tests (which, marked distinctively, were easy to find) and got them all working.
Observations
On the usefulness of annotations
Python 3.5 has the typing
module in the standard library, the use of which results in quite readable
type annotations. This was more
useful than I expected.
I was often trying to keep in mind many disparate parts of the code base and I found it was much more convenient to have type information in the function signature rather than in the docstring, or discernible only from reading the code or call-sites.
I started to pine for tooling to enforce those annotations. Duck-typing doesn’t mean that anything goes: arguments still need to look and quack like the duck you’re expecting.
ABCs and this new and related
typing module make it possible to describe the ducks you’re looking for. It
seems a shame not to take full advantage of it.
I could not get mypy to install. From what I
can tell, this is the big boss of type annotations in Python. It can
statically analyze your program and discover typing mistakes. But I didn’t
have time to figure out what was wrong and learn how to use it. Another day.
However, I would settle for checking annotations at run-time if I could get it
working quickly, so I put together the short typecheck module.
By decorating function and methods with @typecheck.typed and adding
annotations I could make type-related issues shallower, by which I mean that
the code would crash closer to where the problem originated. This made an
immediate difference, especially when unravelling byte/Unicode string issues.
This approach is imperfect and simplistic, sure. There’s none of that uncanny magic you get with, say, Haskell, where a program that merely compiles actually stands a good chance of doing what you meant, first time. But it is valuable all the same; it is another layer of defence.
Annotations and typecheck combined also replace the need for documenting the
types of function arguments and returned values, and of that documentation
being out-of-date, a state towards which documentation rapidly decays.
The Big One: Byte and Unicode strings
Almost all difficulties in this port were caused by Python 2’s automatic
coercion of byte strings into Unicode strings and vice-versa. That one
language feature has a lot of sloppy code to answer for. It has also made it
hard for even the most systematic developer to live free from the shadow of
UnicodeError and its spawn.
It is cold-sweat-inducing to realise that the following code in Python 2 that works fine:
from urllib2 import urlopen
response = urlopen('http://example.com/')
data = json.load(response)
is actually complete bollocks because it disregards the encoding of the
response (i.e. the charset in the Content-Type header).
Python 3 forces you to fix this, but the temptation is to do something like:
data = json.loads(response.read().decode("utf-8"))
which is a different class of bollocks because, although UTF-8 is common, it still disregards the encoding of the response. So, Python 3 gives us a big shove in the right direction but can’t yet magically fix faulty reasoning.
We used unicode_literals in our code. In code reviews we would check for
correct encoding and decoding. We forbade the use of str. These things
helped, I am sure, but I expected far fewer surprises from our own code; you
might even say I was shocked at how much coercion between bytes and
unicode was going on once Python 3 was there to coax it out.
Fixing these issues was, at a guess, over half of the work required to port MAAS.
In retrospect I wish there had been a way to disable automatic coercion in
Python 2 although I suspect it would have been unworkable in practice; that’s
Python 3’s big feature after all. A more selective Unicode-only literals
feature with a corresponding unicodeonly type (and a converse bytesonly
type) that Python would never automatically coerce to a byte string might have
been a workable way to improve the sorry string story in Python 2.
Sorting disparate types
Python 3 doesn’t allow sorting of different types unless they explicitly support it. However, one important part of MAAS uses this.
MAAS’s Web API publishes a description document; a blob of JSON that describes all the objects and calls available. The CLI client downloads this once and refers back to it when generating sub-commands and options. When the server’s API is updated we need to detect that the client is working from an outdated description.
To do this, the server renders a canonical representation of the description
document and calculates an SHA1 hash from it. This is included in the
description that the client downloads, and the server also sends it in an
X-MAAS-API-Hash header in every HTTP response. The client can compare the
server’s hash with the local hash; if they differ, the API has changed.
Rendering the canonical representation is where the problem lies. We want to ensure a consistent ordering, and we had relied on Python 2’s built-in rules for a few types:
None < Numeric/Boolean < String < Tuple
We reproduced this by creating wrappers — KeyCanonicalNone,
KeyCanonicalNumeric, KeyCanonicalString, and KeyCanonicalTuple — that
sort correctly with respect to one another. A function, key_canonical, wraps
disparate objects according to type, and can be used with sorted:
sorted(disparate_objects, key=key_canonical)
This solved our problem and we were back in business.
Things that 2to3 misses
I’m a Bad Person because these are bugs and I didn’t capture enough context at the time to be able to report them, nor have I tried to reproduce them since:
-
string.lettersis not automatically changed tostring.ascii_letters. -
Imports of
__builtin__are changed tobuiltins, but references to__builtin__are not updated. -
Imports of
urllib2are changed tourllib.*, but some references are missed.
Miscellaneous
-
Not all of
twisted.conchhas been ported. This means that we can no longer support the little-known introspect service in MAAS. It’s a niche service for developer-driven debugging and it’s not enabled by default, so we dropped it. -
2to3converts things likeisinstance(thing, (bytes, unicode))toisinstance(thing, (bytes, str)), but it’s likely that we only want eitherstrorbytesin Python 3. -
sudo_write_fileconflated its core mission (writing a file as another user viasudo) with encoding the file content. I changed it to instead raiseTypeErrorif the given content is not a byte string, so that encoding must be done by the caller. -
atomic_writealso conflated its mission: it expected text content and silently encoded it as UTF-8. It will now raiseTypeErrorif the content is not a byte string; again, encoding must be done by the caller. -
TFTP paths are always byte strings. Other paths are often, but not always, represented as Unicode strings. This caused some difficulty.
-
Integer division: we had to change many expressions like
a / bintoa // bto ensure integer results. -
When testing web interactions, content coming from Django is always a byte string. We used
django.conf.settings.DEFAULT_CHARSETto decode. Strictly, however, we should have checked theContent-Typeheader. -
Python 3 cushions us by wrapping
sys.std{in,out,err}inio.TextIOWrappers, but when forking processes you are presented with the underlying reality: byte streams. The question arises: which character encoding should we use? TheLANGandLC_*environment variables typically coordinate these kinds of understandings between processes. A newselect_c_utf8_locale()function was created to select theC.UTF-8locale. For cooperating applications this will mean we can reliably use UTF-8. -
Command-line arguments given to
subprocess’s functions should be Unicode strings, and it will encode them as appropriate. I did check further: that bit is implemented in C, but the result is very similar to callingos.fsencode(). -
Python 3 has only new-style classes. Classes that explicitly inherit from
objectcan be amended to implicitly inherit fromobjectinstead. -
No one seems to have paid any attention to the years of deprecation warnings about
Exception.message. I conclude that deprecation warnings are more useful as retrospective justifications for breaking someone else’s application than they are useful in getting that person to update their application in time.
That’s it
I hope you find this useful when making your own plans. Good luck!