Journal Articles
Browse in : |
All
> Journals
> Overload
> o155
(6)
All > Topics > Design (236) Any of these categories - All of these categories |
Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.
Title: The Path of Least Resistance
Author: Bob Schmidt
Date: 04 February 2020 18:47:24 +00:00 or Tue, 04 February 2020 18:47:24 +00:00
Summary: Python’s modules and imports can be overwhelming. Steve Love attempts to de-mystify the process.
Body:
There is more to Python than scripting. As its popularity grows, people naturally want to do more with Python than just knock out simple scripts, and are increasingly wanting to write whole applications in it. It’s lucky, then, that Python provides great facilities to support exactly that need. It can, however, be a little daunting to switch from using Python to run a simple script to working on a full-scale application. This article is intended to help with that, and to show how a little thought and planning on the structure of a program can make your life easier, and help you to avoid some of the mistakes that will certainly make your life harder.
We will examine modules and packages, how to use them, and how to break your programs into smaller chunks by writing your own. We will look in some detail at how Python locates the modules you import, and some of the difficulties this presents. Throughout, the examples will be based on the tests, since this provides a great way to see how modules and packages work. As part of that, we’ll get to explore how to test modules independently, and how to structure packages to make using them straightforward.
A Python program
We begin with that simple, single file script, because it introduces some of the ideas that we’ll build on later. Listing 1 is a simple tool to take CSV data and turn it into JSON.
import csv import json import sys parser = csv.DictReader( sys.stdin ) data = list( parser ) print( json.dumps( data, sort_keys=True, indent=2 ) ) |
Listing 1 |
This program takes its input from stdin
, and parses it as comma separated values. The DictReader
parser turns each row of the CSV data into a dict
.
Next, the entire input is read by creating a list of the dictionaries.
Lastly, the list is transformed into JSON, and the result (pretty) printed to stdout
.
The code itself is not really what this article is about. What’s more interesting are the import
statements at the top of the program. All 3 of these modules are part of the Python Standard Library. The import
statement tells the Python interpreter to find a module (or a package, but we’ll get to that) and make its contents available for use.
The fact that these are standard library modules means they’re always available if you have Python installed. The implication here is that you can share this program with anyone, and they’ll be able to use it successfully if they have Python installed (and it’s the right version, and you’ve told them how to use it).
That’s all well and good for such a simple program, but sooner or later (sooner, hopefully!) someone will ask “Where are the tests?â€, followed quickly by “And what about error handling?†Error handling is left as an exercise, but testing allows us to explore more features of Python’s support for modules.
Modularity 0.0
The basic Python feature for splitting a program into smaller parts is the function. It may seem overkill for our tiny script with only 3 lines of working code, but one of the side-effects (bumph-tish!) of creating a function is that with a little care, you can make a unit-test to exercise it independently. ‘Independently’ has multiple levels of meaning here: independent of other functions in your program, independent of the file system or other external components like databases, independent of user input – and, by the way – screen output. Your unit-tests should work as if the computer is disconnected from the outside world – no keyboard, no mouse, no screen, no network, no disk.
Why is this important? Partly because disk and network access is slow and unreliable, and you want your tests to run quickly, and partly because your tests might be run by some automated system like a Continuous Integration system that might not have access to the same things you do on your workstation.
Our program so far doesn’t lend itself to being easily and automatically tested, so we’ll start with factoring out the code that parses CSV data into a list of dictionaries. Listing 2 shows an example.
import csv import json import sys def read_csv(input): parser = csv.DictReader(input) return list(parser) data = read_csv(sys.stdin) print(json.dumps(data, sort_keys = True, indent = 2)) |
Listing 2 |
The function still uses a DictReader
, but instead of directly using sys.stdin
, it passes the argument it receives from the calling code.
The calling code now passes sys.stdin
to the function and captures the result. The printing to screen remains the same.
A simple first test for this might be to check that the function doesn’t misbehave if the input is empty. Although Python has a built-in unit-testing facility, there are lightweight alternatives, and the examples in this article all use pytest. Pytest will automatically discover test functions with names prefixed with test, in files with names prefixed with test_. For this example, assume the program is in a file called csv2json.py, and the test file is test_csv2json.py, containing Listing 3.
from csv2json import read_csv def test_read_csv_accepts_empty_input(): result = read_csv(None) assert result == [] |
Listing 3 |
The first line imports the read_csv
function from our csv2json script. The import name csv2json
is just the name of the file, without the .py extension.
To start with, we write a test that we expect to fail, just to ensure we’re actually exercising what we think we’re exercising. csv.DictReader
works with iterable objects, but passing None
should certainly cause an error.
Once again, it’s the import that’s interesting, because it shows that any old Python script is also a module that can be imported. There is nothing special about code in Listing 2 to make it a module. A Python module is a namespace, so names within it must be unique, but can be the same as names from other namespaces without a clash. The syntax for importing shown here indicates that we only want one identifier from the csv2json module. Alternately, we could use
import csv2json
and then explicitly qualify the use of the function with the namespace:
result = csv2json.read_csv( None ).
We have a function, and a test with which to exercise it. Running that test couldn’t be easier. From a command prompt/shell, in the directory location of the test script, run pytest
[pytest].
But wait! What’s this? (Figure 1 shows the output from pytest.)
==================== ERRORS ==================== ____ ERROR collecting test_csv2json.py ____ test_csv2json.py:1: in <module> from csv2json import read_csv csv2json.py:9: in <module> data = read_csv( sys.stdin ) csv2json.py:7: in read_csv return list( parser ) …​ "pytest: reading from stdin while output is captured! Consider using `-s`." E OSError: pytest: reading from stdin while output is captured! Consider using `-s`. |
Figure 1 |
It looks like the test failed, but not in the way we expect. It demonstrates another aspect of Python’s import behaviour: importing a module runs the module. That’s obviously not what we intended. What we need is some way to indicate the difference between running a python program, and importing its contents to a different program. Sure enough, Python provides a simple way to do this.
Each Python module in a program has a unique name, which you can access via the global __name__
value. Additionally, if you invoke a program, then its name is always __main__
. Listing 4 shows this in action.
import csv import json import sys def read_csv(input): parser = csv.DictReader(input) return list(parser) if __name__ == '__main__': data = read_csv(sys.stdin) print(json.dumps(data, sort_keys = True, indent = 2)) |
Listing 4 |
Importing a module runs all the top-level code – which includes the definition of functions. The code within a function (or class) is syntax checked, but not invoked
When the script is run, the test for __name__
will fail if it’s being run as a result of an import of the module.
The function is invoked explicitly if the script is being run rather than imported.
Running the test now produces the output shown in Figure 2.
=================== FAILURES =================== ____ test_read_csv_accepts_empty_input() ____ def test_read_csv_accepts_empty_input(): > result = read_csv( None ) test_csv2json.py:4: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ csv2json.py:6: in read_csv parser = csv.DictReader( input ) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <csv.DictReader object at 0x000001E882CD94A8>, f = None, fieldnames = None, restkey = None, restval = None, dialect = 'excel' args = (), kwds = {} def __init__(self, f, fieldnames=None, restkey=None, restval=None, dialect="excel", *args, **kwds): self._fieldnames = fieldnames # list of keys for the dict self.restkey = restkey # key to catch long rows self.restval = restval # default value for short rows > self.reader = reader(f, dialect, *args, **kwds) E TypeError: argument 1 must be an iterator |
Figure 2 |
This time, the test has failed in the way we expected: DictReader
is expecting an iterator. We can now alter the test to something we expect to pass, as shown here:
def test_read_csv_accepts_empty_input(): result = read_csv( [] ) assert result == []
And sure enough, it does.
==== test session starts ==== platform win32 -- Python 3.7.2, pytest-5.3.2, py-1.8.0, pluggy-0.13.1 rootdir: collected 1 item test_csv2json.py . [100%] ==== 1 passed in 0.04s ====
The next steps would be to add more tests, and some error handling, and factoring out the JSON export with its own tests, too. These are left as exercises.
What have we learned?
- Simple Python modules are just files with Python code in them
- Importing a module runs all the top-level code in that module
- Python modules all have a unique name within a program, which is accessed via the value of
__name__
- The entry point of the program is a module called
__main__
- Unit tests are a great way of exercising your code, but they’re also great for exercising the structure of the code.
Namespaces | |
|
Packages
Having satisfied the requirement for tests, we may wish to embellish our little library to be a bit more general. One obvious thing might be to extend the functionality to handle other data formats, and perhaps to be able to convert in any direction. It wouldn’t necessarily be unreasonable to just add a new method to the module called json2csv
1. However, if we want to add new text formats, the combinations become unwieldy, and would introduce an unnecessary amount of duplication.
The basis of the library is to take some input and parse it into a Python data structure, which we can then turn into an export format. Having gone to the trouble of separating the inputs and outputs, we can extend the idea and use a different module for each text format. Python provides another facility to bundle several modules together into a package.
As we’ve already explored, a Python module can be just a file containing Python code. A Python package is a module that gathers sub-modules (and possibly sub-packages) together in a directory2, along with a special file named __init__.py. For the time being, this can be just an empty file, but we will revisit this file in a later article.
We’ll begin by just adding our existing module to a package called textfilters
. Our source tree now should look something like this:
root/ |__ textfilters/ |__ __init__.py |__ csv2json.py |__ test_csv2json.py
The import
statement in test_csv2json.py now needs to change as shown here:
from textfilters.csv2json import read_csv def test_read_csv_accepts_empty_input(): result = read_csv( [] ) assert result == []
The import
statement line at the top says ‘import the read_csv
definition from the csv2json
module in the textfilters
package’.
Note that the portion of the import line after the import
keyword defines the namespace. In this example, the namespace consists only of the function name, and needs no further qualification when used.
So far, so good. Whichever method of importing the function you use, running the test works, and (hopefully!) your tests still pass.
The next step is to split the functions into their separate modules. So far, we have two text formats to deal with: CSV and JSON. It makes some sense, therefore, to handle all the CSV functionality in a module called csv
, and all the JSON in a module called json
. Our original CSV function – read_csv
– now has a name with some redundancy, duplicating as it does the name of the new module. I also decided that ‘read’ and ‘write’ weren’t really accurate names, implying some kind of file-system access, and decided upon input
and output
. Listing 5 shows the content of the new csv
module called csv.py.
import csv from io import StringIO def input(data): parser = csv.DictReader(data) return list(parser) def output(data): if len(data) > 0: with StringIO() as buffer: writer = csv.DictWriter(buffer, data[ 0].keys()) writer.writeheader() for row in data: writer.writerow(row) return buffer.getvalue() return "" |
Listing 5 |
As previously, it’s not really the implementation that’s interesting about this, it’s that first import
statement. Remember, this is a file called csv.py, and so its module name is csv
. It should be clear from the code that the intention is to import the built-in csv
module, and that is indeed what will probably happen, due to a number of factors. It’s time to talk about how Python imports modules.
The search for modules
Importing a module (and remember, packages are modules too) requires the Python interpreter to locate the code for that module, and make its code available to the program by binding the code to a name. You can’t usually give an absolute path directly, but Python has a number of standard places in which it attempts to locate the module being imported.
Python keeps a cache of modules that have already been imported. Where an imported module is a sub-module, the parent module (and its parents) are also cached. This cache is checked first to see if a requested module has already been imported.
Next, if the module was not found in the cache, the paths in the system-defined sys.path
list are checked in order. This sounds simple enough, but this is probably where the consequences of giving our own module the same name as a built-in one will become apparent. The contents of sys.path
are initialized on startup with the following:
- The directory containing the script being invoked, or an empty string to indicate the current working directory in the case where Python is invoked with no script (i.e. interactively).
- The contents of the environment variable
PYTHONPATH
. You can alter this to change how modules are located when they’re imported. - System defined search paths for built-in modules.
- The root of the
site
module (more on this in a later article).
Let’s consider the directory layout containing our package:
root/ |__ textfilters/ |__ __init__.py |__ csv.py |__ ... |__ test_filters.py
If you invoke a python script, or run the Python interpreter, in the root/textfilters/ directory, the statement import csv
will find the csv
module in that directory first. Note that the current working directory is not used if a Python script is invoked. Correspondingly, invoking a Python script, or running the Python interpreter in any other location would result in import csv
importing the built-in csv
module.
Back to our csv.py module, the statement import csv
would indeed be recursive if you were to invoke the script directly, or any other script in the same directory that imported it. However, the usual purpose of a package is for it to be imported into a program, and the reason for making a directory for a package is to keep the package’s contents separate from other code.
To demonstrate this, let’s have a look at how the test script looks now it’s using the textfilters
package in Listing 6.
from textfilters import csv def test_read_csv_accepts_empty_input(): result = csv.input([]) assert result == [] |
Listing 6 |
The import
statement is explicitly requesting the csv
module from the textfilters
package.
As long as the textfilters
package is found, then this script will use the csv
module within it, and will never import the built-in module of the same name. In this instance, the invoked Python script is test_filters.py, and the search path will have its directory at the head of sys.path
. The textfilters
package is found as a child of that directory, and all is good with the world.
If the textfilters
package were located somewhere else, away from your main program, you could add its path to the PYTHONPATH
environment variable to ensure it was found. As previously mentioned, you don’t directly import modules using absolute paths, but the PYTHONPATH
environment variable is one indirect way of specifying additional search paths. Requiring all users of your module(s) to have an appropriate setting for PYTHONPATH
is a bit heavy-handed3, but can be useful during development.
It needs to be said that deliberately naming your own modules to clash with built-in modules is a bad idea, because just relying on the search behaviour to ensure the right module is imported is flying close to the wind, to say the least. However, things are never that simple: you cannot know what future versions of Python will call new modules, and you cannot know what other 3rd party libraries your users have installed. This is one reason why it’s a good idea to partition your code with packages, and be explicit about the names you import.
What have we learned?
- A Python package is a module that has sub-modules. Standard Python packages contain a special file called __init__.py.
- The package name forms part of the namespace, and needs to be unique in a program.
- Python looks for modules in a few standard places, defined in a list called sys.path.
- You can modify the search path easily by defining (or changing) an environment variable called
PYTHONPATH
. - You should partition your code with packages to minimize the risk of your names clashing with other modules.
Relativity
One of the reasons for packaging up code is to make it easy to share. At the moment, we have a package – textfilters
, and the test code for it lives outside the package. If we share the package, we should also share the tests, and having them inside the package means we can more easily share it just by copying the whole directory.
Look back at Listing 6, and note the import
statement directly names the package. While this is fine (the tests should pass), it seems redundant to have to explicitly name it. Since this test module is now part of the same package as the csv
module, what’s wrong with just import csv
?
The problem with that is we will get caught out (again) by the Python search path for modules; import csv
will just import the built-in csv
module. Is there an alternative to explicitly having to give a full, absolute, name to modules imported within a package?
We previously learned that you can’t give an absolute path to the import
statement, but you can request a relative path when importing within a module, i.e. one file in a package needs to import another module within the same package. Listing 7 shows the needed invocation of import.
from . import csv def test_read_csv_accepts_empty_input(): result = csv.input( [] ) assert result == [] |
Listing 7 |
The test file is now part of the textfilters
package, and so uses .
instead of the package name to indicate ‘within my package directory’.
As we’ve already noted, Python modules have a special value called __name__
. We’ve looked at how this is set to __main__
when a module is run as a script. When a module is imported, this value takes on the namespace name, which is essentially the path to the module relative to the application’s directory, but separated by .
instead of \
or /
.
Python modules have another special value which is used to resolve relative imports within it. This is the __package__
value, which contains the name of the package. If the module is not part of a package, then this value is the empty string. This is discussed in detail in PEP 366 [PEP366], but the important thing to note here is that, since the testing module is now part of a package, the relative import path shown in Listing 7 uses the value of __package__
to determine how to find the csv module to be imported.
The package so far is a basic outline, with a simple API to translate one data format to another. We can extend this idea with facilities to transform the data as it passes through, perhaps to rename fields, or select only those fields we need. We could clearly just make a new package for these general utilities, but the intention is that they’re used in conjunction with the functions we’ve already created, so let’s instead make the new package a child of the existing one.
root/ |__ textfilters/ |__ __init__.py |__ csv.py |__ ... |__ transformers/ |__ __init__.py |__ change.py |__ test_filters.py
Relative imports also work for sub-packages. This is easily demonstrated with more tests, this time for the transformers/change.py module, as shown in Listing 8.
# textfilters / transformers / change.py def change_keys(data, fn): return { fn(k): data[k] for k in data } # textfilters / test_change.py from .transformers.change import change_keys def test_change_keys_transforms_input(): d = { 1: 1 } res = change_keys(d, lambda k: 'a') assert res == { 'a': 1 } |
Listing 8 |
The test module imports the code under test from change.py using a relative import path prefixing the package and module names.
In this case, it’s exactly as if we’d written:
from textfilters.transformers.change import change_keys.
By now, we’re collecting a few test modules in the base of the package directory, and we might want to think about more tidying up to gather all the tests together away from the actual package code. We can do this by creating a new sub-package called tests
and moving all the test code into it. This must be a sub-package, and so requires an __init__.py of its own, and the relative imports need to change as shown in Listing 9.
# textfilters / tests / test_change.py from ..transformers.change import change_keys def test_change_keys_transforms_input(): d = { 1: 1 } res = change_keys(d, lambda k: 'a') assert res == { 'a': 1 } |
Listing 9 |
The relative module imports now have an extra dot to indicate the parent package location.
This may look a bit like relative paths on a file-system, where ../ is the parent directory, it’s not quite the same. Packages can be arbitrarily nested, and to indicate the grand-parent directory, on Linux you’d say ../../, whereas in Python you just add another dot: ...
What have we learned?
- Modules in a package can use relative imports to access code within the same module.
- Packages have a special value
__package__
which is used to resolve relative imports. - Packages can have sub-packages, and these can be deeply nested – if you really want!
All together now
We’ve demonstrated how to write test code in the package to exercise our little library, so now for completeness, it’s time to demonstrate a little running program4. Listing 10 shows how it might be used.
from textfilters import csv, json from textfilters.transformers.change import change_keys import sys if __name__ == '__main__': def key_toupper(k): return k.upper() data = csv.input(sys.stdin) result = [change_keys(row, key_toupper) for row in data ] print(json.output(result, sort_keys = True, indent = 2)) |
Listing 10 |
This short program essentially does what the code in Listing 1 did, with the added bonus of taking the column-names and changing them to upper case. Importing the required functions is the job of the first 2 lines, and for now, the textfilters
package needs to be on Python’s search path. The simplest way to do that is to have it as a sub-directory of the main application folder.
Have we achieved what we set out to achieve? We had three goals in mind at the start:
- to be able to test independent code independently
- to be able to split our programs into manageable chunks
- to be able to share those chunks with others.
We have created a package, with its own set of tests. Those tests not only test each part of the package code independently of the others, but also – and importantly – independently of the code that uses it in the main application. This makes sharing the code with others much easier: the package is a small, self-contained parcel of functionality.
The import statements are a little verbose, due to the use of sub-packages, and the need to make the package directory a direct child of the main application is a little unwieldy.
In the next installment, we will explore how to improve on both of those things so that your fellow users will really like using your library.
References and resources
[NamespacePkg] Python Namespace Packages: https://packaging.python.org/guides/packaging-namespace-packages/
[PEP366] ‘PEP 366, Main module explicit relative imports’ https://www.python.org/dev/peps/pep-0366/
[pytest] ‘A Python testing framework’ https://docs.pytest.org/en/latest/
- It doesn’t always make sense to do this conversion due to the possibility of nesting in the JSON input, of course, but just for the exercise.
- This is a little simplistic, because there is no intrinsic requirement for modules or packages to exist on a file-system, but it suffices for now.
- You can also modify sys.path in code, since it’s just a list of paths, but your users will probably not thank you for it
- Ok, not actually completeness, because some of the functionality that this uses was left as exercises. You did do the exercises?
is an independent developer constantly searching for new ways to be more productive without endangering his inherent laziness.
Notes:
More fields may be available via dynamicdata ..