hourly 1 http://backend.userland.com/rss

Programming Topics + CVu Journal Vol 16, #2 - Apr 2004

Browse in :

All > Topics > Programming
All > Journals > CVu > 162
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: A Python Script to Relocate Source Trees

Author:

Date: 01 April 2004 22:53:48 +01:00 or Thu, 01 April 2004 22:53:48 +01:00

Summary:

Files form the raw ingredients of a software system - source files, build files, configuration files, resource files, scripts etc. These files are organised into directories.

Body:

As the system develops, this directory structure must develop with it: maybe an extra level of hierarchy needs adding to accommodate a new revision of an operating system; maybe third party libraries need gathering into a single place; maybe we are porting to a platform which imposes some restriction on file names; or maybe the name originally chosen for a directory has simply become misleading.

This article describes the development of a simple Python script to facilitate relocating a source tree. It is as much - if not more - about getting started with Python as it is about solving the particular problem used as an example.

Statement of the Problem

Let's suppose that we have a directory structure we wish to modify. This existing structure has been reviewed and the decision has been taken to remap source directories as follows:

png         -> graphics/thirdparty/png
jpeg        -> graphics/thirdparty/jpeg
bitmap      -> graphics/common/bitmap
UserIF      -> ui
UserIF/Wgts -> ui/widgets
os          -> platform/os
os/hpux     -> platform/os/hpux10

By "re-mapped", I mean that the directory and its contents should be recursively moved to the new location. So, for example:

UserIF/Wgts/buttons/switchbutton.cpp
  -> ui/widgets/buttons/switchbutton.cpp

Although it's straightforward create a new top-level directory and copy existing directories to their new locations, the problem we then face is that our source files will no longer build because the files they include have moved. In fact, some of the build files themselves need adjusting, since they too reference moving targets.

An Outline Solution

In outline, our script will implement a two-pass algorithm:

1st Pass:: Traverse all files in the current source tree, working out where they will move to. The output of this pass is a container which maps existing files to their new locations.
2nd Pass:: For each file found in the first pass, perform the actual relocation, updating any internal references to file paths.

The actual processing for a file in the 2nd pass depends on the type of the file. We can simply copy a bitmap, for example, to its new home, but when relocating a C/C++ source file we'll need to be more careful. This is why I've chosen a two-pass solution: when updating internal file references, I prefer look-up to recalculation.

First Pass - Iterating Over Files

We want to map existing files to their new locations. In Python, the builtin mapping type is called a dictionary. The output of pass one will be a dictionary, which we initialise to be empty.

files_map = {}

There are two standard modules which support file and directory operations:

os - "Miscellaneous operating system interfaces"
os.path - "Common pathname manipulations"

Both of these will be of use to our script. In fact, both provide a mechanism for traversing a directory tree:

os.path.walk: which calls back a supplied function at each subdirectory, passing that function the subdirectory name and a list of the files it contains.
os.walk: which generates a 3-tuple (dirpath, dirnames, filenames) for each subdirectory in the tree.

The second option, os.walk, only exists in Python 2.3 (2.3 strengthens the language's support for generators). I prefer it since it makes the script more direct.

import os
# Initialise a dictionary to map current file
# path to new file path.
files_map = {}

# Fill the dictionary by remapping all files
# beneath the current working directory.
for (subdir, dirs, files) in os.walk('.'):
  print "Mapping files in subdir [%s]" % subdir
  files_map.update(
             mapFiles(subdir, files)
             )

Note the general absence of visible symbols to delimit blocks and expressions - a colon marks the end of the for condition, and that's about it. Expressions are terminated by a newline, unless the newline is escaped with a backslash or the expression is waiting for a closing bracket to complete it. Thus the print statement terminates at the newline, but the dictionary update statement spreads over three lines. Note also that statements can be grouped into a block by placing them at the same indentation level: the body of the for loop is a block of two statements.

To a C/C++ programmer these syntactical rules may seem unusual, dangerous even - attaching meaning to whitespace!? - but I would argue that they actually encourage clean and well laid out scripts.

Incidentally, the default behaviour of the print statement is to add a newline after printing. Appending a trailing comma would print a space instead of this newline.

Pass One: Mapping Files and Directories

If we attempt to run the script as it stands, we'll see an exception thrown:

<...snip...>
NameError: name 'mapFiles' is not defined

which is as we'd expect. We need to define the function:

def mapFiles(dirname, files):
  """Return a dictionary mapping files to their new locations."""
  new_dir = mapDirectory(dirname)
  print "mapDirectory [%s] -> [%s]" % \
                          (dirname, new_dir)
  fm = {}
  for f in files:
    fm[os.path.join(dirname, f)] = \
       os.path.join(new_dir, f)
  return fm

The Python interpreter needs to know about this function before it can use it, so we'll place it before the path traversal loop.

The first statement of the function body is the function's (optional) documentation string, or docstring. The Python documentation explains why it's worth getting the habit of using docstrings and the conventions for their use.

The function fills a dictionary mapping files to their new location. It uses os.path.join from the os.path module to construct a file path. The backslash is there to escape a newline, allowing the dictionary itemsetter to continue onto a second line.

The final component of the first pass is the function mapDirectory, which maps an existing directory to its new location.

The Python Interpreter

If we were to start an interactive Python shell and load the mapFiles function, we could then query it and its attributes:

>>> mapFiles
<function mapFiles at 0x01106630>
>>> dir(mapFiles)
['__call__', '__class__', '__delattr__',
'__dict__', '__doc__', '__get__',
'__getattribute__', '__hash__', '__init__',
'__module__', '__name__', '__new__',
'__reduce__', '__reduce_ex__', '__repr__',
'__setattr__', '__str__', 'func_closure',
'func_code', 'func_defaults', 'func_dict',
'func_doc', 'func_globals', 'func_name']

>>> mapFiles.func_doc
'Return a dictionary mapping files in dir to
their new locations.'

I like to switch between editor and interpreter when developing scripts (on Windows, the PythonWin IDE makes this easy to do, or, alternatively, Python's -i option), since it helps me understand both how my script works and how Python works. Here we can see the name "mapFiles" refers to a function object which has a list of attributes.

The Python interpreter also allows functions to be exercised immediately:

>>> mapFiles('png',('pngRead.h','pngWrite.h'))

def mapDirectory(dname):
  """Return the new location of the input directory."""
  # The following dictionary maps existing
  # directories to their new locations.
  dirmap = {
    'png'         : 'graphics/thirdparty/png',
    'jpeg'        : 'graphics/thirdparty/jpeg',
    'bitmap'      : 'graphics/common/bitmap',
    'UserIF'      : 'ui',
    'UserIF/Wgts' : 'ui/widgets',
    'os'          : 'platform/os',
    'os/hpux'     : 'platform/os/hpux10'
  }
  # Successively reduce the directory path
  # until it matches one of the keys in the
  # dictionary.
  mapped_dir = p = dname
  while p and not p in dirmap:
    p = os.path.dirname(p)
  if p:
    mapped_dir = os.path.join(dirmap[p],
                              dname[len(p) + 1:])
 return mapped_dir

The directory rearrangement described earlier in this article has been represented as a dictionary. The input directory is reduced until we match a key in this dictionary. As soon as we find such a match, we construct our return value from the value at this key and the un-matched tail of the input directory; or, if no such match is found, the input value is returned unmodified.

The expression dname[len(p) + 1:] is a slice operation applied to a string. Bearing in mind that p is the first len(p) characters in dir, this expression returns what's left of dname, omitting the slash which separates the head from the tail of this path.

For example, when mapping the directory 'os/hpux/include' we would expect to exit the while loop when p == 'os/hpux', and return the result of joining the path 'platform/os/hpux10' to 'include'.

mapDirectory('os/hpux/include')
  -> os.path.join('platform/os/hpux10',
                   'include')
  -> 'platform/os/hpux10/include'

Pass One: Testing

Let's test these expectations by adding the following lines to our script:

assert(mapDirectory('os/hpux/include')
    == 'platform/os/hpux10/include')
assert(mapDirectory('os/win32')
    == 'platform/os/win32')
assert(mapDirectory('unittests')
    == 'unittests')

These tests pass on unix platforms, but if you run them on Windows the first two tests raise AssertionError exceptions (although the final one passes). For now, I'll leave you to work out why - but promise a more platform independent solution in the final version of the script.

Second Pass

Recall that:

2nd Pass:: For each file found in the first pass, perform the actual relocation, updating any internal references to file paths.

and that the output of this phase is a dictionary, files_map. The main loop for the 2nd pass is:

# Create the new root directory for relocated
# files
new_root = '../relocate'
os.makedirs(new_root)

# Now actually perform the relocation
for srcdst in files_map.items():
  relocate(file, new_root, files_map)

The function os.makedirs recursively creates a directory path. It throws an exception if the directory path already exists. We will not catch this exception since we want to ensure the files are being relocated to a completely new directory.

Pass Two: Relocating by Copying

import shutil
def relocate(srcdst, dst_root, files_map):
  """Relocate a file, correcting internal file references."""
  dst_file = os.path.join(dst_root, srcdst[1])
  dst_dir = os.path.dirname(dst_file)
  if not os.path.isdir(dst_dir):
    os.makedirs(dst_dir)
  if isSourceFile(dst_file):
    relocateSource(srcdst, dst_file,
                   files_map)
  else:
    shutil.copyfile(srcdst[0], dst_file)

There aren't many new features to comment on here. The first parameter to the function, srcdst is a (key, value) item from the files_map dictionary, so srcdst[0] is the path to the original file, and srcdst[1] is the path to the relocated file, relative to the new root directory.

We create the destination directory unless it already exists. Then, if our file is a source file, we call relocateSourceFile; otherwise, we simply copy the file across.

I admit this isn't the most object-oriented of functions. The meaning of the literal 0 and 1 isn't transparent, and switching on type often indicates unfamiliarity with polymorphism. It isn't Python that's to blame here, nor a lack of familiarity with polymorphism: rather a lack of familiarity on my part with Python's support for polymorphism, and a reluctance to add such sophistication to a simple script.

Pass Two: Identifying Source Files

import re

def isSourceFile(file):
  """Return True if the input file is a C/C++ source file."""
  src_re = re.compile(r'\.(c|h)(pp)?$',re.IGNORECASE)
  return src_re.search(file) is not None

We identify source files using a regular expression pattern:

r'\.(c|h)(pp)?$'

Here, the "r" stands for raw string, which means that backslashes are not handled in any special way by Python - the string literal is passed directly on to the regular expression module.

Regular expression patterns in Python are as powerful, concise and downright confusing to the uninitiated as they are elswhere. I would say that subsequent use of regular expression matches is a little more friendly.

In this case, the regex reads: "match a '.' followed by either a 'c' or an 'h' followed by one or no 'pp's, followed by the end of the string". The re.IGNORECASE flag tells the regex compiler to ignore case.

So, we expect:

assert(isSourceFile('a.cpp'))
assert(isSourceFile('a.C'))
assert(not isSourceFile('a.cc'))
assert(not isSourceFile('a.cppp'))
assert(isSourceFile('a.cc.h'))

This time, the assertions hold.

Pass Two: Relocating Source Files

def relocateSource(srcdst, dstfile, files_map):
"""Relocate a source file, correcting included file paths to included."""
  fin = file(srcdst[0], 'r')
  fout = file(dstfile, 'w')
  for line in fin
    fout.write(processSourceLine(line,
               srcdst, files_map))
  fin.close()
  fout.close()

The function relocateSource() simply reads the input file line by line. Each line is converted and written to the output file.

Pass Two: Processing a Line of a Source File

def processSourceLine(line, srcdst, files_map):
  """Process a line from a source file, correcting included file paths."""
  include_re = re.compile(
        r'^\s*#\s*include\s*"'
        r'(?P<inc_file>\S+)'
        '"')
  match = include_re.match(line)
  if match:
    mapped_inc_file = mapIncludeFile(
        match.group('inc_file'),
        srcdst,
        files_map)
    line = line[:match.start('inc_file')] + \
           mapped_inc_file + \
           line[match.end('inc_file'):]
  return line

The function processSourceLine has a rather more complicated regex at its core. Essentially we want to spot lines similar to:

#include "UserIF/Wgts/Menu.hpp"

and extract the double-quoted file path. The complication is that there may be any amount of whitespace at several points on the line - hence the appearances of \s*, which reads "zero or more whitespace characters".

The three raw strings which comprise the regex will be concatenated before the regex is compiled - in the same way that adjacent string literals in C/C++ get joined together in an early phase of compilation. I have split the string in this way to emphasise its meaning.

The bizarre (?P<inc_file>\S+) syntax creates a named group: essentially, it allows us to identify the sub-group of a match object using "inc_file".

So, the function looks for lines of the form:

#include "inc_file"

then calls mapIncludeFile(inc_file...) to find what should now be included, and returns:

#include "mapped_inc_file"

Incidentally, I am assuming here that the angle brackets are reserved for inclusion of standard library files - or at least not the files we are moving. That is, we don't try and alter lines such as:

#include <vector>

Pass Two: Mapping Include Files

import sys

def mapIncludeFile(inc, srcdst, files_map):
  """Determine the remapped include file path."""
  # First, obtain a path to the include file
  # relative to the original source root
  if os.path.dirname(inc):
    pass # Assumption 1) - "inc" is our
         # relative path
  else:
    # Assumption 2) The file must be located
    # in the same directory as the source file
    # which includes it.
    inc = os.path.join(
               os.path.dirname(srcdst[0]), inc)
  # Look up the new home for the file
  try:
    mapped_inc = files_map[inc]
    if (os.path.dirname(mapped_inc) ==
        os.path.dirname(srcdst[1])):
      mapped_inc = os.path.basename(mapped_inc)
  except KeyError:
    print 'Failed to locate [%s] (included ' \
          'by [%s]) ' \
          'relative to source root.' % (
          include, srcdst[0])
    sys.exit(1)
  return mapped_inc

The function mapIncludeFile is actually quite simple, though only because of an assumption I have made about the way include paths are used in this source tree. The assumption is:

All #include directives give a path name relative to the root of the source tree, except when the included file is present in the same directory as the source file - in which case the file can be included directly by its basename. Furthermore, there are no source files at the top-level of the source tree (there are only directories at this level).

For source trees with more complex include paths, and correspondingly more subtle #include directives, this function will need fairly heavyweight adaptation. (Alternatively, run another script to simplify your include paths first.)

If this assumption holds, we can easily determine the original path to the included files, then use our files_map dictionary to look up the new path. If the assumption doesn't hold, then the dictionary look up will fail, raising a KeyError exception. The exception is caught, a diagnostic printed, then the script exits with status 1.

We could test whether "mapped_inc" is a key in our dictionary before attempting to use it; and if it were absent, we could simply print an error and continue. However, we choose to view such an absence as exceptional since it undermines the assumptions made by the script. We do not wish to risk moving thousands of files without being sure of what we're doing.

We could test whether "mapped_inc" is a key in our dictionary before attempting to use it; and if it were absent, we could simply print an error and continue. However, we choose to view such an absence as exceptional since it undermines the assumptions made by the script. We do not wish to risk moving thousands of files without being sure of what we're doing.

Finishing Touches

The final script appears at the end of this article.

The Windows problem I mentioned earlier is caused by the platform specific directory path separator. Unix uses forward slashes, Windows backslashes. My chosen solution is to work with "normalised" paths internally until we actually write out the include files, when we make sure forward slashes are used as separators.

As a gesture towards user-friendliness I have added an indication of progress and an output log. However, this script remains very much for software developers who understand what it's doing. I have chosen "sys.stdout.write" in preference to the "print" statement used during the script's development, since it gives greater control over output format.

I have not done anything special with Makefiles, project files, Jamfiles - or whatever else you use with your build system. There will be an order of magnitude fewer of these to deal with (unless you have a very strange build system), but the same techniques apply.

The script as it stands has several weaknesses. It is, of course, suited to doing a very specific job: solving the exact problem laid out at the start of this article, right down to the specified directory mapping. Whilst it would be overkill to provide a GUI allowing users to enter this mapping, this input data could usefully be separated from the body of the script. Similarly, as already mentioned, the script makes some big assumptions about way include paths are used in this particular system. Finally, there are no unit tests - the only testing has been a rather ad hoc probing of functions during the script's development.

Concluding Thoughts

This sort of bulk re-arrangement of files is well suited to a scripted solution for reasons of:

Reliability:: The script can be shown to work by unit tests and by system tests on small data sets. Then it can be left to do its job.
Efficiency:: Editing dozens - perhaps hundreds - of files by hand is error prone and tedious. What's worse, unless some moratorium on checkins has been imposed during the restructure, the new structure may be out of date before it is ready. A script can process megabytes of source in minutes. Alternatively, it will happy to run at night after even the most nocturnal of programmers has logged out.
Recordability:: The script becomes part of the source tree (perhaps in the tools/scripts directory), in which place it can record accurately and repeatably the tasks it performs.
Reusability:: A well-written script can be amended and enhanced to solve future source re-organisations. And even if it can't be re-used, a knowledge of scripting can.

Final Script

import os
import re
import shutil
import sys

def mapDirectory(dirmap, dname):
  """Return new location of the input directory."""
  # Successively reduce the directory path until it
  # matches one of the keys in the directory map.
  mapped_dir = p = dname
  while p and not p in dirmap:
    p = os.path.dirname(p)
  if p:
    mapped_dir = os.path.join(dirmap[p], dname[len(p) + 1:])
  return mapped_dir

def mapFiles(logfp, dirmap, dname, files):
  """Return a dictionary mapping files in dir to their new locations."""
  dname = os.path.normpath(dname)
  new_dir = mapDirectory(dirmap, dname)
  logfp.write("mapDirectory [%s] -> [%s]\n" % (dname, new_dir))
  fm = {}
  for f in files:
    src = os.path.join(dname, f)
    dst = os.path.join(new_dir, f)
    logfp.write("\t[%s] -> [%s]\n" % (src, dst))
    fm[src] = dst
  return fm

def isSourceFile(file):
  """Return True if the input file is a C/C++ source file."""
  src_re = re.compile(r'\.(c|h)(pp)?$', re.IGNORECASE)
  return src_re.search(file) is not None

def swapSlashes(str):
  """Return the input string with backslashes swapped to forward slashes."""
  back_to_fwd_re = re.compile(r'\\')
  return back_to_fwd_re.sub('/', str)

def mapIncludeFile(logfp, inc, srcdst, files_map):
  """Determine the remapped include file path."""
  # First, obtain a path to the include file
  # relative to the original source root
  if os.path.dirname(inc):
    pass # Assumption 1) - "inc" is our relative path
  else:
    # Assumption 2) The file must be located in the
    # same directory as the source file which
    # includes it.
    inc = os.path.join(os.path.dirname(srcdst[0]), inc)
    inc = os.path.normpath(inc)
  # Look up the new home for the file
  try:
    mapped_inc = files_map[inc]
  except KeyError:
    err_msg= ('\nFatal error: Failed to locate [%s] '
              '(included by [%s]) '
              'relative to source root.' %
              (inc, srcdst[0]))
    logfp.write(err_msg)
    sys.stderr.write(err_msg)
    sys.exit(1)
  if (os.path.dirname(mapped_inc) == os.path.dirname(srcdst[1])):
    mapped_inc = os.path.basename(mapped_inc)
  return mapped_inc

def processSourceLine(logfp, line, srcdst, files_map):
  """Process a line from a source file, correcting included file paths."""
  include_re = re.compile(r'^\s*#\s*include\s*"'
                          r'(?P<inc_file>\S+)'
                          '"')
  match = include_re.match(line)
  if match:
    logfp.write(' [%s] -> ' % line.rstrip())
    mapped_inc_file = mapIncludeFile(logfp,
                                     match.group('inc_file'),
                                     srcdst,
                                     files_map
                                    )
    line = line[:match.start('inc_file')] + mapped_inc_file + line[match.end('inc_file'):]
    line = swapSlashes(line)
    logfp.write('[%s]\n' % line.rstrip())
  return line

def relocateSource(logfp, srcdst, dstfile, files_map):
  """Relocate a source file, correcting paths to included files."""
  infp = open(srcdst[0], 'r')
  outfp = open(dstfile, 'w')
  logfp.write('Relocating source file [%s] -> [%s]\n' % (srcdst[0], dstfile))
  for line in infp:
    outfp.write(processSourceLine(logfp, line, srcdst, files_map))
  infp.close()
  outfp.close()

def relocate(logfp, srcdst, dst_root, files_map):
  """Relocate a file, correcting internal file references."""
  dst_file = os.path.join(dst_root, srcdst[1])
  dst_dir = os.path.dirname(dst_file)
  if not os.path.isdir(dst_dir):
    os.makedirs(dst_dir)
  if isSourceFile(dst_file):
    relocateSource(logfp, srcdst, dst_file, files_map)
  else:
    logfp.write('Copying [%s] -> [%s]\n' % (srcdst[0], dst_file))
    shutil.copyfile(srcdst[0], dst_file)

def percent(num, denom):
  """Return num / denom expressed as an integer percentage."""
  return int(num * 100 / denom)

def printSettings(fp, dmap, src, dst):
  """Output script settings to the input file."""
  fp.write('Relocating source tree from [%s] ' 
           'to [%s]\n' %
           (src, dst))
  fp.write('Relocating directories:\n')
  for d in dirmap:
    fp.write(' [%s] -> [%s]\n' % (d, dmap[d]))
  fp.write('\n')

# Main processing starts here...
# First, set up script data:
# - a dictionary mapping existing dirs to
# their new locations,
# - the source and destination roots for
# the source tree,
np = os.path.normpath
dirmap = {
  np('png') : np('graphics/thirdparty/png'),
  np('jpeg') : np('graphics/thirdparty/jpeg'),
  np('bitmap') : np('graphics/common/bitmap'),
  np('UserIF') : np('ui'),
  np('UserIF/Wgts') : np('ui/widgets'),
  np('os') : np('platform/os'),
  np('os/hpux') : np('platform/os/hpux10')
}

from_root = '.'
to_root = '../relocate'

# Further initialisation.
logfp = open('relocate.log', 'w')
printSettings(logfp, dirmap, from_root, to_root)
printSettings(sys.stdout, dirmap, from_root, to_root)

# Initialise a dictionary to map current file path
# to new file path.
files_map = {}

# Fill the dictionary by remapping all files beneath
# the current working directory.
sys.stdout.write('Preprocessing files. '
                 'Please wait.\n')

for (subdir, dirs, files) in os.walk(from_root):
  files_map.update(
    mapFiles(logfp, dirmap, subdir, files)
  )

# Create the new root directory for relocated files
os.makedirs(to_root)

# Now actually perform the relocation
count = len(files_map)
item = 0
sys.stdout.write('Relocating [%d] files.\n'
                 'Logfile at [%s]\n'
                 'Progress [%02d%%]' %
                 (count, logfp.name, percent(item, count))
                 )

for srcdst in files_map.items():
  item += 1
  sys.stdout.write('\b\b\b\b%02d%%]' % percent(item, count))
  relocate(logfp, srcdst, to_root, files_map)

logfp.close()
sys.stdout.write('\nRelocation completed '
                 'successfully.\n')

I used the PythonWin IDE (available with the ActivePython distribution) while developing this script, and can recommend it to Windows users. The context sensitive help is great, the debugger is good, and the key-bindings - incredibly - include my favourites from both GNU emacs and Microsoft Visual Studio. ActivePython: http://www.activestate.com/Products/ActivePython/
Thanks to Dan Tallis for reviewing an earlier draft of this article, and for inspiring me to learn Python.

[python] http://www.python.org - the official website for the Python language.

Notes:

More fields may be available via dynamicdata ..