Title: Adding Python 3 Compatibility to Python 2 Code

Author: Bob Schmidt

Date: 04 March 2020 23:05:11 +00:00 or Wed, 04 March 2020 23:05:11 +00:00

Summary: Silas S. Brown explains how to cope with the differences.

Body:

When Python 3 was new, its pace of change was fairly quick, and as most of us didnâ€™t want to spend too long rewriting our code to adapt to every new release, we carried on using the far more stable Python 2. Now that Python 2 is being thrown out of GNU/Linux distributions, weâ€™re finally having to convert all our code to Python 3 (unless we want to compile Python 2 in our home directories and just hope no more security issues arise, although that approach is not possible in every situation), and Pythonâ€™s â€˜2to3â€™ tool does not help with everything (I donâ€™t use it as in my case it did more harm than good to my code). Since I have a lot of legacy Python code and Iâ€™d rather work with â€˜stable intermediate formsâ€™, I have been trying to convert as much as possible of it to work on both Python 2 and Python 3 from the same codebase. But this dual-compatibility has more caveats.

Byte-strings

In Python 2, the default string type is a byte-string, and Unicode strings are something else. But a Unicode string containing only ASCII will compare as equal to the same ASCII in a byte-string, and the index operator [] on a string will give a string of length 1 in both byte and Unicode strings. In Python 3, however, the default string type is now Unicode (and the representation for byte string-literals is not compatible with all versions of Python 2), and more subtly a Unicode string containing only ASCII will not be considered equal to its equivalent byte-string, and the index operator [] on a byte-string gives an integer: if you want a string of length 1 then youâ€™d better convert it into slice notation i.e. s[i:i+1] instead of s[i]. Since the slice-notation version behaves identically in Python 2 and Python 3, I suggest converting all single-index operators to that, plus making sure as much as possible of your code will work regardless of whether itâ€™s given byte-strings or Unicode-strings as input, using type if necessary to determine the type of its input. But remember str means different things on the two platforms; a quick way of checking if weâ€™re on Python 3 is to check if type("")==type(u"").

Code that mentions encode('utf-8') or decode('utf-8') will particularly need attention (and even more so if other character sets are in use). I also find it useful to define some small helper functions to â€˜make sure this thing is a byte-stringâ€™ (calling .encode if itâ€™s Unicode) or â€˜make sure this thing is a Unicode-stringâ€™ (calling .decode if itâ€™s a byte-string) â€“ sometimes these are best done in such a way that non-string objects can be passed through unchanged. String operations like .replace (and the regex library) can work on both Unicode strings and byte-strings, but theyâ€™ll fault if thereâ€™s inconsistency between their parameters (e.g. b.replace(x,y) where b is a byte-string and x and y are Unicode strings will fail), so those â€˜make sure this thing is aâ€™ helper functions can be especially useful for porting regex-related code.

Another thing to be aware of is that file I/O (and stdin, stdout and stderr) might or might not be done in UTF-8 by default: it depends on your systemâ€™s locale. When you have the luxury of a GNU/Linux system thatâ€™s set to UTF-8 by default, itâ€™s easy to forget that the Microsoft Windows platform has an annoying habit of setting locale charset to something other than UTF-8, and even some Linux-based environments (such as containers) use the â€˜Câ€™ locale instead, in which case Python 3â€™s I/O (when not done in binary mode) will fault on anything that isnâ€™t ASCII. To work around this from inside your script (i.e. if setting up the right environment variables before Python runs is not an option), the easiest way is probably to write code like Listing 1. Obviously you should do this only if you know for sure that the input and output really should be in UTF-8 and the systemâ€™s locales are simply not set up properly (see Listing 1).

if type("")==type(u""): # Python 3+
  import codecs
  # Make sure stdin and stdout are set to UTF-8,
  # even if the system's locales don't have 
  # UTF-8.
  stdin=codecs.getreader("utf-8")
    (sys.stdin.buffer)
  stdout=codecs.getwriter("utf-8")
    (sys.stdout.buffer)
  old_stdin, sys.stdin = sys.stdin, stdin
  old_stdout, sys.stdout = sys.stdout, stdout

Listing 1

Numbers

In Python 2, division of two integers is an integer operation just as it is in C. But in Python 3, division of two integers will convert it to a floating-point number, and if you wanted to have the integer then you must ask for it explicitly. This likely means many of the divisions in your code will need some attention. Also the L suffix for long integers has been removed; if you want compatibility with early versions of Python 2 (which required L) and also Python 3, youâ€™ll probably have to reach these numbers by multiplying up or similar, and may also have to detect the Python version at runtime and go down different branches as appropriate.

Standard output and error

Python 3 of course makes print into a function which requires parentheses (and I still donâ€™t understand why that change gets more attention than the byte-strings change, but perhaps I do more work with Unicode than most English developers do). print with parentheses will also work in Python 2, but if supplied more than one argument, it will make its arguments look like a tuple, which is probably not what you want. Compatible with both versions is to restrict print to one argument and use format strings or construct the string manually (but remember to account for Unicode string / byte string differences in Python 3); also of note is that Python 2 code containing print by itself for a blank line will need to be written as print() in Python 3, or print("") for compatibility with both versions.

You might prefer to use sys.stdout, and/or sys.stderr for the â€˜standard errorâ€™ stream (which is a separate stream if your programâ€™s standard output has been redirected to a file or pipe). But another difference between Python 2 and Python 3 is that, in Python 3, sys.stderr is buffered in the same way as sys.stdout is, i.e. the output wonâ€™t happen until you call sys.stderr.flush() or output a newline. If this matters, you might need to add some calls to sys.stderr.flush() that are unnecessary (but harmless) in Python 2.

Reading and writing from files in Python 3 automatically converts to/from Unicode strings; if you want bytes, you must either open the file in binary mode (rb or wb) or else use the fileâ€™s .buffer member (which is not present on Python 2, so youâ€™ll have to write an if-else branch depending on the Python version). Note that .buffer is only a weak reference: you must keep a reference to the file itself, not just its buffer, or youâ€™ll find it has been automatically closed.

Library changes

There are too many standard library changes between Python 2 and Python 3. In some cases itâ€™s just a matter of importing a different module, and you can have if-else branches in your imports to maintain compatibility with both versions. For example, commands.getoutput now needs to be subprocess.getoutput, thread now needs to be _thread, and various HTML-related and urllib-related libraries may need importing differently. But there are other libraries with more substantial changes, e.g. the email module works completely differently in Python 3 (my IMAP-processing code is still stuck in Python 2 for this reason); some usage of StringIO might need to be BytesIO on Python 3 (and now imported from io); some exceptions have been renamed and might need assigning for compatibility; and version 6 of the third-party Tornado library has completely changed the way it does callbacks and IOLoop (although I managed to make Web Adjuster compatible with both versions by writing some fancy decorators).

Some built-in functions are also no longer available in Python 3, so you might have to write things like:

  try: unichr # Python 2
  except: unichr,xrange = chr,range # Python 3

to keep your code compatible. Also, some things that used to return lists now return iterators, and if you want a list you must explicitly ask for one, so for example you can no longer say:

  Unicode_Greek_letters = range(0x3b1,0x3ca) 
  + range(0x391,0x3aa) # wrong

youâ€™ll have to say list(range()) instead. Most notably, .items() no longer returns a list: some Python 2 code will assume that it does, and will assume that the dictionary from which it was taken may be changed without averse effect on the .items() list it has (this is now likely to raise an exception if used in a loop), so you may wish to wrap all use of .items() in list() to help port this.

Also the sort() functions and methods have changed: they no longer take comparison functions, only key functions. Python 2 sort() can also take key=, so if you can rewrite all your comparison functions as key functions, i.e. functions that return the â€˜equivalent valueâ€™ of a single item for sorting purposes, then you can write this in a way thatâ€™s compatible with both 2 and 3.

There are many other subtle changes, and you will need to test the code carefully in both versions of Python before considering it compatible with both. But the above changes were the most important ones to make in my code so far.

Summary

The most likely places that will need amending are:

Anywhere where Unicode is converted to/from UTF-8, or where files are written/read
Any [] index operators that might be applied to byte strings (use slices for maximum compatibility)
Any use of .replace or re.sub (make sure itâ€™s all the same type)
Any divisions (should we take the integer?)
print and import statements
Any writes to sys.stderr (do we need to flush?)
Any use of .items() (does it need to be put into a list() now?), and sort() with comparison function

As always, good test coverage is the most important thing, and you may have to go through several iterations before it works.

Silas S. Brown is a partially-sighted Computer Science post-doc in Cambridge who currently works in part-time assistant tuition and part-time for Oracle. He has been an ACCU member since 1994.

Notes: