Title: The Path of Least Resistance

Author: Bob Schmidt

Date: 04 February 2020 18:47:24 +00:00 or Tue, 04 February 2020 18:47:24 +00:00

Summary: Pythonâ€™s modules and imports can be overwhelming. Steve Love attempts to de-mystify the process.

Body:

There is more to Python than scripting. As its popularity grows, people naturally want to do more with Python than just knock out simple scripts, and are increasingly wanting to write whole applications in it. Itâ€™s lucky, then, that Python provides great facilities to support exactly that need. It can, however, be a little daunting to switch from using Python to run a simple script to working on a full-scale application. This article is intended to help with that, and to show how a little thought and planning on the structure of a program can make your life easier, and help you to avoid some of the mistakes that will certainly make your life harder.

We will examine modules and packages, how to use them, and how to break your programs into smaller chunks by writing your own. We will look in some detail at how Python locates the modules you import, and some of the difficulties this presents. Throughout, the examples will be based on the tests, since this provides a great way to see how modules and packages work. As part of that, weâ€™ll get to explore how to test modules independently, and how to structure packages to make using them straightforward.

A Python program

We begin with that simple, single file script, because it introduces some of the ideas that weâ€™ll build on later. Listing 1 is a simple tool to take CSV data and turn it into JSON.

import csv
import json
import sys

parser = csv.DictReader( sys.stdin )
data = list( parser )
print( json.dumps( data, sort_keys=True,
  indent=2 ) )

Listing 1

This program takes its input from stdin, and parses it as comma separated values. The DictReader parser turns each row of the CSV data into a dict.

Next, the entire input is read by creating a list of the dictionaries.

Lastly, the list is transformed into JSON, and the result (pretty) printed to stdout.

The code itself is not really what this article is about. Whatâ€™s more interesting are the import statements at the top of the program. All 3 of these modules are part of the Python Standard Library. The import statement tells the Python interpreter to find a module (or a package, but weâ€™ll get to that) and make its contents available for use.

The fact that these are standard library modules means theyâ€™re always available if you have Python installed. The implication here is that you can share this program with anyone, and theyâ€™ll be able to use it successfully if they have Python installed (and itâ€™s the right version, and youâ€™ve told them how to use it).

Thatâ€™s all well and good for such a simple program, but sooner or later (sooner, hopefully!) someone will ask â€œWhere are the tests?â€, followed quickly by â€œAnd what about error handling?â€ Error handling is left as an exercise, but testing allows us to explore more features of Pythonâ€™s support for modules.

Modularity 0.0

The basic Python feature for splitting a program into smaller parts is the function. It may seem overkill for our tiny script with only 3 lines of working code, but one of the side-effects (bumph-tish!) of creating a function is that with a little care, you can make a unit-test to exercise it independently. â€˜Independentlyâ€™ has multiple levels of meaning here: independent of other functions in your program, independent of the file system or other external components like databases, independent of user input â€“ and, by the way â€“ screen output. Your unit-tests should work as if the computer is disconnected from the outside world â€“ no keyboard, no mouse, no screen, no network, no disk.

Why is this important? Partly because disk and network access is slow and unreliable, and you want your tests to run quickly, and partly because your tests might be run by some automated system like a Continuous Integration system that might not have access to the same things you do on your workstation.

Our program so far doesnâ€™t lend itself to being easily and automatically tested, so weâ€™ll start with factoring out the code that parses CSV data into a list of dictionaries. Listing 2 shows an example.

import csv
import json
import sys
def read_csv(input):
  parser = csv.DictReader(input)
  return list(parser)

data = read_csv(sys.stdin)
print(json.dumps(data, sort_keys = True, 
  indent = 2))

Listing 2

The function still uses a DictReader, but instead of directly using sys.stdin, it passes the argument it receives from the calling code.

The calling code now passes sys.stdin to the function and captures the result. The printing to screen remains the same.

A simple first test for this might be to check that the function doesnâ€™t misbehave if the input is empty. Although Python has a built-in unit-testing facility, there are lightweight alternatives, and the examples in this article all use pytest. Pytest will automatically discover test functions with names prefixed with test, in files with names prefixed with test_. For this example, assume the program is in a file called csv2json.py, and the test file is test_csv2json.py, containing Listing 3.

from csv2json import read_csv
def test_read_csv_accepts_empty_input():
  result = read_csv(None)
  assert result == []

Listing 3

The first line imports the read_csv function from our csv2json script. The import name csv2json is just the name of the file, without the .py extension.

To start with, we write a test that we expect to fail, just to ensure weâ€™re actually exercising what we think weâ€™re exercising. csv.DictReader works with iterable objects, but passing None should certainly cause an error.

Once again, itâ€™s the import thatâ€™s interesting, because it shows that any old Python script is also a module that can be imported. There is nothing special about code in Listing 2 to make it a module. A Python module is a namespace, so names within it must be unique, but can be the same as names from other namespaces without a clash. The syntax for importing shown here indicates that we only want one identifier from the csv2json module. Alternately, we could use

  import csv2json

and then explicitly qualify the use of the function with the namespace:

  result = csv2json.read_csv( None ).

We have a function, and a test with which to exercise it. Running that test couldnâ€™t be easier. From a command prompt/shell, in the directory location of the test script, run pytest [pytest].

But wait! Whatâ€™s this? (Figure 1 shows the output from pytest.)

==================== ERRORS ====================
____ ERROR collecting test_csv2json.py ____

test_csv2json.py:1: in <module>
    from csv2json import read_csv
csv2json.py:9: in <module>
    data = read_csv( sys.stdin )
csv2json.py:7: in read_csv
    return list( parser )
â€¦â€‹

    "pytest: reading from stdin while output is captured!  Consider using `-s`."
E   OSError: pytest: reading from stdin while output is captured!  Consider using `-s`.

Figure 1

It looks like the test failed, but not in the way we expect. It demonstrates another aspect of Pythonâ€™s import behaviour: importing a module runs the module. Thatâ€™s obviously not what we intended. What we need is some way to indicate the difference between running a python program, and importing its contents to a different program. Sure enough, Python provides a simple way to do this.

Each Python module in a program has a unique name, which you can access via the global __name__ value. Additionally, if you invoke a program, then its name is always __main__. Listing 4 shows this in action.

import csv
import json
import sys
def read_csv(input):
  parser = csv.DictReader(input)
  return list(parser)
if __name__ == '__main__':
  data = read_csv(sys.stdin)
  print(json.dumps(data, sort_keys = True,
    indent = 2))

Listing 4

Importing a module runs all the top-level code â€“ which includes the definition of functions. The code within a function (or class) is syntax checked, but not invoked

When the script is run, the test for __name__ will fail if itâ€™s being run as a result of an import of the module.

The function is invoked explicitly if the script is being run rather than imported.

Running the test now produces the output shown in Figure 2.

=================== FAILURES ===================
____ test_read_csv_accepts_empty_input() ____

    def test_read_csv_accepts_empty_input():
>       result = read_csv( None )

test_csv2json.py:4:
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
csv2json.py:6: in read_csv
    parser = csv.DictReader( input )
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <csv.DictReader object at 0x000001E882CD94A8>, f = None, fieldnames = None, restkey = None, restval = None, dialect = 'excel'
args = (), kwds = {}

    def __init__(self, f, fieldnames=None,
                 restkey=None, restval=None,
                 dialect="excel", *args, **kwds):
        self._fieldnames = fieldnames   
          # list of keys for the dict
        self.restkey = restkey          
          # key to catch long rows
        self.restval = restval          
          # default value for short rows
>       self.reader = reader(f, dialect, *args,
                      **kwds)
E       TypeError: argument 1 must be an iterator

Figure 2

This time, the test has failed in the way we expected: DictReader is expecting an iterator. We can now alter the test to something we expect to pass, as shown here:

  def test_read_csv_accepts_empty_input():
    result = read_csv( [] )
    assert result == []

And sure enough, it does.

  ==== test session starts ====
  platform win32 -- Python 3.7.2, pytest-5.3.2, 
  py-1.8.0, pluggy-0.13.1
  rootdir:
  collected 1 item
  
  test_csv2json.py .      [100%]
  
  ==== 1 passed in 0.04s ====

The next steps would be to add more tests, and some error handling, and factoring out the JSON export with its own tests, too. These are left as exercises.

What have we learned?

Simple Python modules are just files with Python code in them
Importing a module runs all the top-level code in that module
Python modules all have a unique name within a program, which is accessed via the value of __name__
The entry point of the program is a module called __main__
Unit tests are a great way of exercising your code, but theyâ€™re also great for exercising the structure of the code.

Namespaces

You may need to explicitly qualify function calls with namespacing, if, for example, you were importing another module or package that defined a read_csv() function. A full qualification includes the name of the package:

import textfilters.csv2json
def test_read_csv_accepts_empty_input():
  result = textfilters.csv2json.read_csv( [] )
  assert result == []

It may be, however, that just the module name provides sufficient uniqueness in the name, and so a compromise is possible:

from textfilters import csv2json

def test_read_csv_accepts_empty_input():
  result = csv2json.read_csv( [] )
  assert result == []

Donâ€™t confuse this general idea with Pythonâ€™s Namespace Packages [NamespacePkg], which are used to split packages up across several locations.

Packages

Having satisfied the requirement for tests, we may wish to embellish our little library to be a bit more general. One obvious thing might be to extend the functionality to handle other data formats, and perhaps to be able to convert in any direction. It wouldnâ€™t necessarily be unreasonable to just add a new method to the module called json2csv¹. However, if we want to add new text formats, the combinations become unwieldy, and would introduce an unnecessary amount of duplication.

The basis of the library is to take some input and parse it into a Python data structure, which we can then turn into an export format. Having gone to the trouble of separating the inputs and outputs, we can extend the idea and use a different module for each text format. Python provides another facility to bundle several modules together into a package.

As weâ€™ve already explored, a Python module can be just a file containing Python code. A Python package is a module that gathers sub-modules (and possibly sub-packages) together in a directory², along with a special file named __init__.py. For the time being, this can be just an empty file, but we will revisit this file in a later article.

Weâ€™ll begin by just adding our existing module to a package called textfilters. Our source tree now should look something like this:

  root/
   |__ textfilters/
          |__ __init__.py
          |__ csv2json.py
   |__ test_csv2json.py

The import statement in test_csv2json.py now needs to change as shown here:

from textfilters.csv2json import read_csv 

def test_read_csv_accepts_empty_input():
  result = read_csv( [] )
  assert result == []

The import statement line at the top says â€˜import the read_csv definition from the csv2json module in the textfilters packageâ€™.

Note that the portion of the import line after the import keyword defines the namespace. In this example, the namespace consists only of the function name, and needs no further qualification when used.

So far, so good. Whichever method of importing the function you use, running the test works, and (hopefully!) your tests still pass.

The next step is to split the functions into their separate modules. So far, we have two text formats to deal with: CSV and JSON. It makes some sense, therefore, to handle all the CSV functionality in a module called csv, and all the JSON in a module called json. Our original CSV function â€“ read_csv â€“ now has a name with some redundancy, duplicating as it does the name of the new module. I also decided that â€˜readâ€™ and â€˜writeâ€™ werenâ€™t really accurate names, implying some kind of file-system access, and decided upon input and output. Listing 5 shows the content of the new csv module called csv.py.

import csv
from io
import StringIO
def input(data):
  parser = csv.DictReader(data)
  return list(parser)
def output(data):
  if len(data) > 0:
    with StringIO() as buffer:
      writer = csv.DictWriter(buffer, data[
        0].keys())
      writer.writeheader()
      for row in data:
        writer.writerow(row)
      return buffer.getvalue()
  return ""

Listing 5

As previously, itâ€™s not really the implementation thatâ€™s interesting about this, itâ€™s that first import statement. Remember, this is a file called csv.py, and so its module name is csv. It should be clear from the code that the intention is to import the built-in csv module, and that is indeed what will probably happen, due to a number of factors. Itâ€™s time to talk about how Python imports modules.

The search for modules

Importing a module (and remember, packages are modules too) requires the Python interpreter to locate the code for that module, and make its code available to the program by binding the code to a name. You canâ€™t usually give an absolute path directly, but Python has a number of standard places in which it attempts to locate the module being imported.

Python keeps a cache of modules that have already been imported. Where an imported module is a sub-module, the parent module (and its parents) are also cached. This cache is checked first to see if a requested module has already been imported.

Next, if the module was not found in the cache, the paths in the system-defined sys.path list are checked in order. This sounds simple enough, but this is probably where the consequences of giving our own module the same name as a built-in one will become apparent. The contents of sys.path are initialized on startup with the following:

The directory containing the script being invoked, or an empty string to indicate the current working directory in the case where Python is invoked with no script (i.e. interactively).
The contents of the environment variable PYTHONPATH. You can alter this to change how modules are located when theyâ€™re imported.
System defined search paths for built-in modules.
The root of the site module (more on this in a later article).

Letâ€™s consider the directory layout containing our package:

  root/
   |__ textfilters/
          |__ __init__.py
          |__ csv.py
          |__ ...
   |__ test_filters.py

If you invoke a python script, or run the Python interpreter, in the root/textfilters/ directory, the statement import csv will find the csv module in that directory first. Note that the current working directory is not used if a Python script is invoked. Correspondingly, invoking a Python script, or running the Python interpreter in any other location would result in import csv importing the built-in csv module.

Back to our csv.py module, the statement import csv would indeed be recursive if you were to invoke the script directly, or any other script in the same directory that imported it. However, the usual purpose of a package is for it to be imported into a program, and the reason for making a directory for a package is to keep the packageâ€™s contents separate from other code.

To demonstrate this, letâ€™s have a look at how the test script looks now itâ€™s using the textfilters package in Listing 6.

from textfilters
import csv
def test_read_csv_accepts_empty_input():
  result = csv.input([])
  assert result == []

Listing 6

The import statement is explicitly requesting the csv module from the textfilters package.

As long as the textfilters package is found, then this script will use the csv module within it, and will never import the built-in module of the same name. In this instance, the invoked Python script is test_filters.py, and the search path will have its directory at the head of sys.path. The textfilters package is found as a child of that directory, and all is good with the world.

If the textfilters package were located somewhere else, away from your main program, you could add its path to the PYTHONPATH environment variable to ensure it was found. As previously mentioned, you donâ€™t directly import modules using absolute paths, but the PYTHONPATH environment variable is one indirect way of specifying additional search paths. Requiring all users of your module(s) to have an appropriate setting for PYTHONPATH is a bit heavy-handed³, but can be useful during development.

It needs to be said that deliberately naming your own modules to clash with built-in modules is a bad idea, because just relying on the search behaviour to ensure the right module is imported is flying close to the wind, to say the least. However, things are never that simple: you cannot know what future versions of Python will call new modules, and you cannot know what other 3rd party libraries your users have installed. This is one reason why itâ€™s a good idea to partition your code with packages, and be explicit about the names you import.

What have we learned?

A Python package is a module that has sub-modules. Standard Python packages contain a special file called __init__.py.
The package name forms part of the namespace, and needs to be unique in a program.
Python looks for modules in a few standard places, defined in a list called sys.path.
You can modify the search path easily by defining (or changing) an environment variable called PYTHONPATH.
You should partition your code with packages to minimize the risk of your names clashing with other modules.

Relativity

One of the reasons for packaging up code is to make it easy to share. At the moment, we have a package â€“ textfilters, and the test code for it lives outside the package. If we share the package, we should also share the tests, and having them inside the package means we can more easily share it just by copying the whole directory.

Look back at Listing 6, and note the import statement directly names the package. While this is fine (the tests should pass), it seems redundant to have to explicitly name it. Since this test module is now part of the same package as the csv module, whatâ€™s wrong with just import csv?

The problem with that is we will get caught out (again) by the Python search path for modules; import csv will just import the built-in csv module. Is there an alternative to explicitly having to give a full, absolute, name to modules imported within a package?

We previously learned that you canâ€™t give an absolute path to the import statement, but you can request a relative path when importing within a module, i.e. one file in a package needs to import another module within the same package. Listing 7 shows the needed invocation of import.

from . import csv 

def test_read_csv_accepts_empty_input():
  result = csv.input( [] )
  assert result == []

Listing 7

The test file is now part of the textfilters package, and so uses . instead of the package name to indicate â€˜within my package directoryâ€™.

As weâ€™ve already noted, Python modules have a special value called __name__. Weâ€™ve looked at how this is set to __main__ when a module is run as a script. When a module is imported, this value takes on the namespace name, which is essentially the path to the module relative to the applicationâ€™s directory, but separated by . instead of \ or /.

Python modules have another special value which is used to resolve relative imports within it. This is the __package__ value, which contains the name of the package. If the module is not part of a package, then this value is the empty string. This is discussed in detail in PEP 366 [PEP366], but the important thing to note here is that, since the testing module is now part of a package, the relative import path shown in Listing 7 uses the value of __package__ to determine how to find the csv module to be imported.

The package so far is a basic outline, with a simple API to translate one data format to another. We can extend this idea with facilities to transform the data as it passes through, perhaps to rename fields, or select only those fields we need. We could clearly just make a new package for these general utilities, but the intention is that theyâ€™re used in conjunction with the functions weâ€™ve already created, so letâ€™s instead make the new package a child of the existing one.

  root/
   |__ textfilters/
          |__ __init__.py
          |__ csv.py
          |__ ...
          |__ transformers/
              |__ __init__.py
              |__ change.py
          |__ test_filters.py

Relative imports also work for sub-packages. This is easily demonstrated with more tests, this time for the transformers/change.py module, as shown in Listing 8.

# textfilters / transformers / change.py
def change_keys(data, fn):
  return {
    fn(k): data[k]
    for k in data
  }
# textfilters / test_change.py
from .transformers.change
import change_keys
def test_change_keys_transforms_input():
  d = {
    1: 1
  }
  res = change_keys(d, lambda k: 'a')
  assert res == {
    'a': 1
  }

Listing 8

The test module imports the code under test from change.py using a relative import path prefixing the package and module names.

In this case, itâ€™s exactly as if weâ€™d written:

  from textfilters.transformers.change import
  change_keys.

By now, weâ€™re collecting a few test modules in the base of the package directory, and we might want to think about more tidying up to gather all the tests together away from the actual package code. We can do this by creating a new sub-package called tests and moving all the test code into it. This must be a sub-package, and so requires an __init__.py of its own, and the relative imports need to change as shown in Listing 9.

# textfilters / tests / test_change.py
from ..transformers.change
  import change_keys
def test_change_keys_transforms_input():
  d = { 1: 1 }
  res = change_keys(d, lambda k: 'a')
  assert res == { 'a': 1 }

Listing 9

The relative module imports now have an extra dot to indicate the parent package location.

This may look a bit like relative paths on a file-system, where ../ is the parent directory, itâ€™s not quite the same. Packages can be arbitrarily nested, and to indicate the grand-parent directory, on Linux youâ€™d say ../../, whereas in Python you just add another dot: ...

What have we learned?

Modules in a package can use relative imports to access code within the same module.
Packages have a special value __package__ which is used to resolve relative imports.
Packages can have sub-packages, and these can be deeply nested â€“ if you really want!

All together now

Weâ€™ve demonstrated how to write test code in the package to exercise our little library, so now for completeness, itâ€™s time to demonstrate a little running program⁴. Listing 10 shows how it might be used.

from textfilters
import csv, json
from textfilters.transformers.change
import change_keys
import sys
if __name__ == '__main__':
  def key_toupper(k):
    return k.upper()
  data = csv.input(sys.stdin)
  result = [change_keys(row, key_toupper) for row
    in data
  ]
  print(json.output(result, sort_keys =
    True, indent = 2))

Listing 10

This short program essentially does what the code in Listing 1 did, with the added bonus of taking the column-names and changing them to upper case. Importing the required functions is the job of the first 2 lines, and for now, the textfilters package needs to be on Pythonâ€™s search path. The simplest way to do that is to have it as a sub-directory of the main application folder.

Have we achieved what we set out to achieve? We had three goals in mind at the start:

to be able to test independent code independently
to be able to split our programs into manageable chunks
to be able to share those chunks with others.

We have created a package, with its own set of tests. Those tests not only test each part of the package code independently of the others, but also â€“ and importantly â€“ independently of the code that uses it in the main application. This makes sharing the code with others much easier: the package is a small, self-contained parcel of functionality.

The import statements are a little verbose, due to the use of sub-packages, and the need to make the package directory a direct child of the main application is a little unwieldy.

In the next installment, we will explore how to improve on both of those things so that your fellow users will really like using your library.

References and resources

[NamespacePkg] Python Namespace Packages: https://packaging.python.org/guides/packaging-namespace-packages/

[PEP366] â€˜PEP 366, Main module explicit relative importsâ€™ https://www.python.org/dev/peps/pep-0366/

[pytest] â€˜A Python testing frameworkâ€™ https://docs.pytest.org/en/latest/

It doesnâ€™t always make sense to do this conversion due to the possibility of nesting in the JSON input, of course, but just for the exercise.
This is a little simplistic, because there is no intrinsic requirement for modules or packages to exist on a file-system, but it suffices for now.
You can also modify sys.path in code, since itâ€™s just a list of paths, but your users will probably not thank you for it
Ok, not actually completeness, because some of the functionality that this uses was left as exercises. You did do the exercises?

Steve Love is an independent developer constantly searching for new ways to be more productive without endangering his inherent laziness.

Notes:

More fields may be available via dynamicdata ..