Porting MAAS to Python 3: The (More) Technical Bits
The MAAS team’s mission was to port MAAS from Python 2.7 to Python 3.5.
This post has a rundown of how we prepared, how we did the port itself, and what we learned. It follows on from Porting MAAS to Python 3 which gives an overview of the port. They’re both written from an engineering perspective, but this post contains a lot more technical detail.
What we did to prepare
We used a few features of Python 2.7 that are meant to help prepare for a port to Python 3, and we also devised one of our own. At the top of every module and script we:
-
imported
unicode_literals
,absolute_imports
, andprint_function
from__future__
, -
selected new-style classes by default with
__metaclass__ = type
, -
forbade the use of
str
withstr = None
.
The latter forced the use of bytes
or unicode
and so brought decisions
about encoding and decoding to the fore. Sadly it couldn’t prevent implicit
coercion between these types.
With an irritating tenacity I would also recommend the use of
dict.view{keys,values,items}
in code reviews.
In Python 2.7 these methods exhibit the closest behaviour to Python 3’s
dict.{keys,values,items}
. They’re also converted cleanly by 2to3
whereas,
for example, dict.keys()
is converted to list(dict.keys())
and
dict.iterkeys()
is converted to iter(dict.keys())
.
The intermediate lists that arise are somewhat wasteful and very often
unnecessary, but it’s hard for 2to3
to know this because the dict
fixer
doesn’t consider the context (and doing so may be a task more suited to a
human in any case). Using the dict.view*
variants gives it a hint.
Unfortunately old habits die hard, and I ended up spending a lot of time
manually reverting these kinds of changes from 2to3
’s patches.
The process
-
A bug in
2to3
means that all__future__
imports needed to be reformated onto a single line (see [reformat-future-imports-on-single-line.py
] reformat-future-imports-on-single-line):bzr ls --kind=file --recursive --versioned --null | \\ xargs -r0 python python3/reformat-future-imports-on-single-line.py
-
The
str = None
lines also needed to be removed (seeremove-str-equals-none-shim.py
):bzr ls --kind=file --recursive --versioned --null | \\ xargs -r0 python python3/remove-str-equals-none-shim.py
-
We converted MAAS’s code directory by directory, but worked with patches instead of getting
2to3
to write directly:2to3 --nofix=callable src/${subcomponent} > \\ python3/fix-${subcomponent}.diff
-
We reviewed patches to sanity check them, and to remove unnecessary conversions. Commit each patch again, then apply:
patch -p0 < python3/fix-${subcomponent}.diff
-
The
__metaclass__ = type
lines and all remaining shims were next to go (seeremove-all-shims.py
):bzr ls --kind=file --recursive --versioned --null | \\ xargs -r0 python python3/remove-all-shims.py
-
We got the tests passing, committing as we went. Problematic tests were skipped like so:
@skip("PYTHON3-TEMPORARY-DURING-PORT")
We did this last step for tests that depended on code that had not yet ported. Instead of pushing that work onto the stack we would just skip the tests and move on. Later on we revisited these tests (which, marked distinctively, were easy to find) and got them all working.
Observations
On the usefulness of annotations
Python 3.5 has the typing
module in the standard library, the use of which results in quite readable
type annotations. This was more
useful than I expected.
I was often trying to keep in mind many disparate parts of the code base and I found it was much more convenient to have type information in the function signature rather than in the docstring, or discernible only from reading the code or call-sites.
I started to pine for tooling to enforce those annotations. Duck-typing doesn’t mean that anything goes: arguments still need to look and quack like the duck you’re expecting.
ABCs and this new and related
typing
module make it possible to describe the ducks you’re looking for. It
seems a shame not to take full advantage of it.
I could not get mypy
to install. From what I
can tell, this is the big boss of type annotations in Python. It can
statically analyze your program and discover typing mistakes. But I didn’t
have time to figure out what was wrong and learn how to use it. Another day.
However, I would settle for checking annotations at run-time if I could get it
working quickly, so I put together the short typecheck
module.
By decorating function and methods with @typecheck.typed
and adding
annotations I could make type-related issues shallower, by which I mean that
the code would crash closer to where the problem originated. This made an
immediate difference, especially when unravelling byte/Unicode string issues.
This approach is imperfect and simplistic, sure. There’s none of that uncanny magic you get with, say, Haskell, where a program that merely compiles actually stands a good chance of doing what you meant, first time. But it is valuable all the same; it is another layer of defence.
Annotations and typecheck
combined also replace the need for documenting the
types of function arguments and returned values, and of that documentation
being out-of-date, a state towards which documentation rapidly decays.
The Big One: Byte and Unicode strings
Almost all difficulties in this port were caused by Python 2’s automatic
coercion of byte strings into Unicode strings and vice-versa. That one
language feature has a lot of sloppy code to answer for. It has also made it
hard for even the most systematic developer to live free from the shadow of
UnicodeError
and its spawn.
It is cold-sweat-inducing to realise that the following code in Python 2 that works fine:
from urllib2 import urlopen
response = urlopen('http://example.com/')
data = json.load(response)
is actually complete bollocks because it disregards the encoding of the
response (i.e. the charset
in the Content-Type
header).
Python 3 forces you to fix this, but the temptation is to do something like:
data = json.loads(response.read().decode("utf-8"))
which is a different class of bollocks because, although UTF-8 is common, it still disregards the encoding of the response. So, Python 3 gives us a big shove in the right direction but can’t yet magically fix faulty reasoning.
We used unicode_literals
in our code. In code reviews we would check for
correct encoding and decoding. We forbade the use of str
. These things
helped, I am sure, but I expected far fewer surprises from our own code; you
might even say I was shocked at how much coercion between bytes
and
unicode
was going on once Python 3 was there to coax it out.
Fixing these issues was, at a guess, over half of the work required to port MAAS.
In retrospect I wish there had been a way to disable automatic coercion in
Python 2 although I suspect it would have been unworkable in practice; that’s
Python 3’s big feature after all. A more selective Unicode-only literals
feature with a corresponding unicodeonly
type (and a converse bytesonly
type) that Python would never automatically coerce to a byte string might have
been a workable way to improve the sorry string story in Python 2.
Sorting disparate types
Python 3 doesn’t allow sorting of different types unless they explicitly support it. However, one important part of MAAS uses this.
MAAS’s Web API publishes a description document; a blob of JSON that describes all the objects and calls available. The CLI client downloads this once and refers back to it when generating sub-commands and options. When the server’s API is updated we need to detect that the client is working from an outdated description.
To do this, the server renders a canonical representation of the description
document and calculates an SHA1 hash from it. This is included in the
description that the client downloads, and the server also sends it in an
X-MAAS-API-Hash
header in every HTTP response. The client can compare the
server’s hash with the local hash; if they differ, the API has changed.
Rendering the canonical representation is where the problem lies. We want to ensure a consistent ordering, and we had relied on Python 2’s built-in rules for a few types:
None < Numeric/Boolean < String < Tuple
We reproduced this by creating wrappers — KeyCanonicalNone
,
KeyCanonicalNumeric
, KeyCanonicalString
, and KeyCanonicalTuple
— that
sort correctly with respect to one another. A function, key_canonical
, wraps
disparate objects according to type, and can be used with sorted
:
sorted(disparate_objects, key=key_canonical)
This solved our problem and we were back in business.
Things that 2to3
misses
I’m a Bad Person because these are bugs and I didn’t capture enough context at the time to be able to report them, nor have I tried to reproduce them since:
-
string.letters
is not automatically changed tostring.ascii_letters
. -
Imports of
__builtin__
are changed tobuiltins
, but references to__builtin__
are not updated. -
Imports of
urllib2
are changed tourllib.*
, but some references are missed.
Miscellaneous
-
Not all of
twisted.conch
has been ported. This means that we can no longer support the little-known introspect service in MAAS. It’s a niche service for developer-driven debugging and it’s not enabled by default, so we dropped it. -
2to3
converts things likeisinstance(thing, (bytes, unicode))
toisinstance(thing, (bytes, str))
, but it’s likely that we only want eitherstr
orbytes
in Python 3. -
sudo_write_file
conflated its core mission (writing a file as another user viasudo
) with encoding the file content. I changed it to instead raiseTypeError
if the given content is not a byte string, so that encoding must be done by the caller. -
atomic_write
also conflated its mission: it expected text content and silently encoded it as UTF-8. It will now raiseTypeError
if the content is not a byte string; again, encoding must be done by the caller. -
TFTP paths are always byte strings. Other paths are often, but not always, represented as Unicode strings. This caused some difficulty.
-
Integer division: we had to change many expressions like
a / b
intoa // b
to ensure integer results. -
When testing web interactions, content coming from Django is always a byte string. We used
django.conf.settings.DEFAULT_CHARSET
to decode. Strictly, however, we should have checked theContent-Type
header. -
Python 3 cushions us by wrapping
sys.std{in,out,err}
inio.TextIOWrapper
s, but when forking processes you are presented with the underlying reality: byte streams. The question arises: which character encoding should we use? TheLANG
andLC_*
environment variables typically coordinate these kinds of understandings between processes. A newselect_c_utf8_locale()
function was created to select theC.UTF-8
locale. For cooperating applications this will mean we can reliably use UTF-8. -
Command-line arguments given to
subprocess
’s functions should be Unicode strings, and it will encode them as appropriate. I did check further: that bit is implemented in C, but the result is very similar to callingos.fsencode()
. -
Python 3 has only new-style classes. Classes that explicitly inherit from
object
can be amended to implicitly inherit fromobject
instead. -
No one seems to have paid any attention to the years of deprecation warnings about
Exception.message
. I conclude that deprecation warnings are more useful as retrospective justifications for breaking someone else’s application than they are useful in getting that person to update their application in time.
That’s it
I hope you find this useful when making your own plans. Good luck!