Journal Articles

CVu Journal Vol 16, #3 - Jun 2004 + Programming Topics
Browse in : All > Journals > CVu > 163 (11)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Functional Programming in Python: An Introduction by Example

Author: Administrator

Date: 03 June 2004 13:16:05 +01:00 or Thu, 03 June 2004 13:16:05 +01:00

Summary: 

Body: 

In a previous article [Guest] I described the development of a simple Python script to relocate source files, demonstrating how Python's built in objects and modules can make light work of such tasks. The final script bore a superficial resemblence to a C program: the code was structured into blocks, with a main routine calling subroutines to get the job done; objects were scoped; and control flow was handled by familiar constructs - loops, if statements and return statements.

Python also allows the creation of more C++ like scripts, complete with classes, encapsulation, inheritance, polymorphism, exceptions etc. In this article, however, I want to show how Python copes cleanly in a situation where C++ struggles, namely functional programming.

This article, then, builds on the previous article by introducing some of Python's functional programming capabilities by means of example. It also shows some more advanced uses of Python's regular expression module.

Statement of the Problem

Let's suppose we want to convert some articles from plain text format into web pages. Since these articles are about programming, they embed fragments of C/C++ source code, and since the presentation of source code matters, we want these fragments to have their syntax highlighted according to the following rules:

builtin types   -> blue
keywords        -> bold, blue
comments        -> italics, red
string literals -> green

I should confess at this point - as will soon become obvious - that I'm an html novice, and am aiming for the very simplest of markup to achieve the scheme presented above: this article is about Python, not web page development. Nonetheless, the finished script could be adapted to produce output suitable for more sophisticated web pages.

Towards a Solution

Full tokenization of C/C++ is beyond the scope of this script. Even hooking in to some third-party parser would be overkill. Fortunately, we can get the job done using a simple pattern matching algorithm.

In outline this algorithm:

  • creates a pattern to match regions of interest in the code,

  • processes the code, marking up these matches.

To get us started, the following script picks out some C++ keywords associated with control flow and emphasises them using bold text:

<python>
import re
html_data = re.sub(r'(if|else|for|do|while)',
                   r'\1', src_data)
</python>

Notice here the \1, which backreferences the first (and in this case, only) group matched by the input pattern. The r before the string literal often appears in Python regular expressions. It stands for rawstring and ensures that backslashes are not handled in any special way by Python - the string literals are passed directly on to the regular expression module.

Patching up the Problems

The simple script above is broken. The code fragment:

<cpp>
  cutlery = fork + knife; // do we need spoons?
</cpp>

would be marked up:

<html>
cutlery = <b>for</b>k + kn<b>if</b>e;
               // <b>do</b> we need spoons?
</html>

There are three unwanted substitutions! Clearly, we should only match keywords which are complete words, and keywords cease to be keywords when they're in a comment. Both problems can be fixed by using a more complex pattern:

<python>
def markup(match):
  """Return an html-marked-up version of the
     input match object"""
  html = ''
  if (match.group(1)):
    html = bold(match.group(1))
  elif (match.group(2)):
    html = italics(match.group(2))
  return html

cpp_re = re.compile(
  r'(\bif\b|\belse\b|\bfor\b|\bdo\b|\bwhile\b)'
  r'|'
  r'(//.*?$|/\*.*?\*/)',
  re.DOTALL |
  re.MULTILINE)

html_data = cpp_re.sub(markup, src_data)
</python>

Some notes on the regex pattern:

  • \b matches (the empty string at) the beginning or end of a word,

  • the flag parameter to re.compile, re.DOTALL | re.MULTILINE, enables us to match C-style multi-line comments (refer to the Python documentation [python] for details),

  • asterisks - such as the ones which appear in C-style comments - are regex special characters, so they need to be escaped for a literal match,

  • .*? is a non-greedy match of any character. Non-greedy means that the regex parser doesn't consume all the text it can to match a part of a pattern before backtracking to match the next part of the pattern. So, for example, //.*?$ matches from the start of a C++ style-comment up to the next end-of-line character, but //.*$ would match from this start point up to the final end-of-line in the data.

And some notes our more advanced use of the re module:

  • parentheses have been used to capture matching text into groups,

  • the first argument to <code>re.sub</code> is now a function which accepts a MatchObject. This function applies markup according to the group which has been matched.

Finally, it's probably worth saying something about the triple-quoted string literal which appears as the first line of the function. This is the function's documentation string, or docstring. These strings are, by convention, triple-quoted, since they often span more than one line. Refer to the Python documentation, and in particular "PEP: 257 Docstring Conventions", for details.

Creating Patterns Using List Comprehensions

I'm not entirely happy with the revised script above. I don't like the repetition of the regex special character, \b, and I don't like the way the C++ keywords have been obscured by the presence of these characters. This situation is only going to get worse when we extend the pattern to include all the other C++ keywords.

We could build the pattern up using an explicit loop and the built in join function.

<python>
keywords = ('if', 'else', 'for', 'do', 'while')
kw_patterns = []
for kw in keywords:
  kw_patterns.append(r'\b%s\b' % kw)
kw_pattern = '|'.join(kw_patterns)
</python>

Alternatively, we can use a list comprehension instead of the loop:

<python>
kw_pattern = '|'.join([r'\b%s\b' % kw
                        for kw in keywords])
</python>

This alternative is the Python idiom for building lists from sequences.

There are downsides to using this idiom. What if a C/C++ programmer tries to understand/modify this script? What if I don't use Python for a while (day to day, I use C++) - will I remember what the syntax means? (Think of all those ultra-succinct Perl programs, which default pretty much everything to apparently conjure their results from thin air.)

In this case, the idiom is well worth adopting. List comprehensions are too useful to be ignored.

Lambda Functions

There's another thing I'm not happy about: the separation of the patterns being matched and the markup applied to the matches. Recalling our original statement of the problem:

builtin types   -> blue
keywords        -> bold, blue
comments        -> italics, red
string literals -> green

I'd like the script to somehow embody this mapping.

The right arrows suggest using a dictionary to map from lexical elements to markup functions. Rather than use a dictionary, we'll choose a sequence of 2-tuples. This is because the elements of a dictionary are not ordered (perhaps dictionary was not the best term for Python to adopt for such a collection!) and we want to control the order in which pattern matches are tested. For example, double will appear as both a builtin type and as a keyword, and we can indicate we want it to match as a builtin by placing builtins before keywords in our markup rules.

So, we want something like:

<python>
def italics(str):
  return '<i>' + str + '</i>'
def bold(str):
  return '<b>' + str + '</b>'
def colour(str, col):
  return ('<font color="' + col + '">'
          + str + '</font>')
def boldBlue(str):
  return bold(colour(str, 'blue'))
# ...similarly define blue, italicsRed etc ...
markup_rules = (
  (builtin_pattern, blue),
  (kw_pattern, boldBlue),
  (comment_pattern, italicsRed),
  (string_pattern, green)
)
</python>

Here, the html markup helper functions italics, colour, bold, are good general purpose utility functions. The markup utilities - blue, boldBlue, italicsRed, green - do not merit existence as named functions: they are simply the result of composing our utilities and binding arguments to values. This is where Python's lambda functions can help:

<python>
markup_rules = (
  (comment_pattern,
   lambda s: colour(italics(s), 'red')),
  (builtin_pattern,
   lambda s: colour(s, 'blue')),
  (keyword_pattern,
   lambda s: colour(bold(s), 'blue')),
  (string_pattern,
   lambda s: colour(s, 'green'))
)
</python>

A lambda function is an anonymous function. Nonetheless, these anonymous functions can be passed around as parameters - we'll be passing them into our regex text substitution function, for example, and we'll use another lambda function to do this.

The Final Script

General Points

The final script appears at the end of this section. Most of its interesting points have already been discussed. Note, though:

  • The lambda function used for pattern substitution, which binds its second argument to our markup rules enabling us to pass our markup rules into the markup function; this works around the fact that a function passed into <code>re.sub</code> must only accept a single match object argument.

  • The use of the built in filter function, applied to the groups present in our match object. This function accepts a predicate function and a list, and returns a filtered list containing only those elements for which the predicate is true. If the predicate is None, the returned list has its zero or empty elements removed.

  • The string literal match pattern, which introduces a negative look behind assertion, (?<!\\). The pattern itself ".*?(?<!\\)" will match an opening double quote followed by a non-greedy sequence of any characters followed by a closing double quote, provided that the closing double quote is not preceded by a backslash - is not escaped in C terminology. This is necessary in case we encounter a string literal of the form:

    <cpp>
      char const * what_he_said = "He said \"Hi!\"";
    </cpp>
    
  • The cookForHtml function. The term "cook", in this context, comes from ([Freidl]). Python's cgi module provides the required function directly.

cpp_keywords = (
  'and',
  'and_eq',
  'asm',
  'auto',
  'bitand',
  'bitor',
  'bool',
  'break',
  'case',
  'catch',
  'char',
  'class',
  'compl',
  'const',
  'const
  'auto',
  'bitand',
  'bitor',
  'bool',
  'break',
  'case',
  'catch',
  'char',
  'class',
  'compl',
  'const',
  'const_cast',
  'continue',
  'default',
  'delete',
  'do',
  'double',
  'dynamic_cast',
  'else',
  'enum',
  'explicit',
  'export',
  'extern',
  'false',
  'float',
  'for',
  'friend',
  'goto',
  'if',
  'inline',
  'int',
  'long',
  'mutable',
  'namespace',
  'new',
  'not',
  'not_eq',
  'operator',
  'or',
  'or_eq',
  'private',
  'protected',
  'public',
  'register',
  'reinterpret_cast',
  'return',
  'short',
  'sizeof',
  'static',
  'static_cast',
  'struct',
  'switch',
  'template',
  'this',
  'throw',
  'true',
  'try',
  'typedef',
  'typeid',
  'typename',
  'union',
  'using',
  'virtual',
  'void',
  'volatile',
  'wchar_t',
  'while',
  'xor',
  'xor_eq'
)

cpp_builtins = (
  'bool',
  'signed',
  'unsigned',
  'char',
  'short',
  'long',
  'float',
  'double',
  'wchar_t'
)

def preformat(str):
  """Return a preformatted version of the
     string."""    
  return '<pre>' + str + '</pre>'

def italics(str):
  """Return an italicised version of the
     string."""
  return '<i>' + str + '</i>'

def bold(str):
  """Return a bold version of the
     string."""
  return '<b>' + str + '</b>'

def colour(str, colour):
  """Returstr):
  """Return a bold version of the string."""
  return '<b>' + str + '</b>'
def colour(str, colour):
  """Return a coloured version of the
     string."""
  return ('<font color="' + colour + '">'
          + str +
          '</font>')

def orPatterns(patterns):
  """Return a pattern which matches any one of
     the input patterns."""
  return '|'.join(patterns)

def cookForHtml(src):
  """Return the input data cooked for html."""
  import cgi
  return cgi.escape(src)

def markup(match, rules):
  """Markup the text matched by the input
     match object."""
  # p the text matched by the input
 match object."""
  # Sanity check: the rules match the groups
  hits = filter(None, match.groups())
  assert(len(hits) == 1)
  assert(len(match.groups()) == len(rules))

  ix = match.lastindex # Last and only index,
                       # in fac # Last and only index, in fact.

  # Careful! - rules are indexed from 0 but
  # matchObject groups are indexed from 1.
  markup_fn = rules[ix - 1][1]
  return markup_fn(match.group(ix))

def cpp2html(cpp_src):
  """Return C++ source code marked up using
  html."""
  comment_pattern = def cpp2html(cpp_src):
  """Return C++ source code marked up using
     html."""
  comment_pattern = o markup_fn(match.group(ix))

def cpp2html(cpp_src):
  """Return C++ source code marked up using html."""
  comment_pattern = orPatterns([
    r'//.*?$',   # C++ style comment
    r'/\*.*?\*/' # C style comment
  ])

  builtin_pattern = orPatterns([
    r'\b%s\b' % bt
    for bt in cpp_builtins
  ])

  keyword_pattern = orPatterns([
    r'\b%s\b' % kw
    for kw in cpp_keywords
  ])
        
  string_pattern = r'".*?(?<!\\)"'
    
  markup_rules = (
    (comment_pattern,
     lambda s: colour(italics(s), 'red')),
    # Need builtins before keywords - there's
    # an overlap, since some keywords are
    # also designated as builtins.
    (builtin_pattern,
     lambda s: colour(s, 'blue')),
    (keyword_pattern,
     lambda s: colour(bold(s), 'blue')),
    (string_pattern,
     lambda s: colour(s, 'green'))
  )

  # Create a regex group for each
  # pattern in the markup rules, and
  # combine these groups into a single
  # pattern.
  cpp_pattern = orPatterns(
    ['(%s)' % p for p, f in markup_rules]
  )

  cpp_re = re.compile(
    cpp_pattern,
    re.DOTALL |
    re.MULTILINE # C-style comments can
                 # span multiple lines
  )
    
  cpp_src = cookForHtml(cpp_src)
    
  cpp_src = cpp_re.sub(.sub(cookForHtml(cpp_src)
    
  cpp_src = cpp_re.sub(
    lambda m: markup(m, markup_rules),
    cpp_src)

  return preformat(cpp_src)


def python2html(srcdata):
  """Return Python source code marked up
     using html."""
  import keyword
  import token
  import tokenize
    
  lines = srcdata.split('\n')    
  def popLine():
    line = ''
    if len(lines) > 0:
      line = lines.pop(0) + '\n'
    return line

  marked_up = ''
  row, col = 0, 0 # Location of the end of
                  # the previous token
    
  for tok in tokenize.generin tokenize.generate_tokens(popLine):
    # The tokenizer skips whitespace.
    # We must put it back.        
    srow, scol = tok[2]
    if (srow > :
      col = 0
    if (scol >= col):
      marked_up += ' ' * (scol - col)
        
    tok_name = token.tok_name[tok[0]]
    str = cookForHtml(tok[1])

    if (tok_name == 'STRING'):
      marked_up += colour(str, 'green')
    elif (tok_name == 'COMMENT'):
      marked_up += colour(italics(str), 'red')
    elif (tok_name == 'NAME' and
          keyword.iskeyword(tok[1])):
      marked_up += colour(bold(str), 'blue')
    else:
      marked_up += str
                              
    row, col = tok[3]

  return preformat(marked_up)

</python>

Marking Up Python Code

I think it's clear that this script could be extended to handle other highlighting schemes or source data of other types. It's just a matter of defining the markup rules and patterns.

The thought of creating a pattern to match Python's various flavours of string literal is rather daunting. Fortunately, the language itself provides a tokenize module which does the job more easily and more accurately. The final script also provides a Python to html conversion function python2html based on the generate_tokens function available in this module.

This function embeds a sub-function, popLine(), which is used as the callable object required as a parameter to generate_tokens. This function can directly access the list of lines in the enclosing function's scope. An alternative would have been to create a functor object initialised with a reference to (or a copy of) the required list. In this case, I prefer to keep related code as close together as possible, and choose the subfunction.

Weaknesses

The script, although powerful, is far from perfect. It bundles data sets together with a number of unrelated utility functions to get the job done. Properly, it should be partitioned into modules: perhaps one for the C++ data tuples, one for the pattern creation utilities, one for the general purpose html markup utilities and one for our main functions. This partitioning is not hard to achieve, but will have to remain a subject for a future article.

To provide a more flexible markup tool, the functions python2html and cpp2html should provide a mechanism for clients to supply their own markup rules. The obvious solution here would be to allow these rules to be supplied as optional function parameters.

Finally, despite Python's support for unit testing, this script contains no unit tests.

Conclusions

Python's support for lambda functions and list comprehensions have allowed us to create a deceptively simple script.

The script by no means demonstrates all of Python's functional programming capabilities. A good reference book ([Beazley], for example) describes these in more depth.

Credits

Thanks again to Dan Tallis for his expert review of an earlier draft of this article.

References

[Guest] Thomas Guest, "A Python Script to Relocate Source Trees", C Vu 16.2.

[Freidl] Jeffrey E. F. Freidl, Mastering Regular Expressions. The definitive reference on regular expressions, though note that it focuses mainly on Perl's regex support.

[Beazley] David Beazley, Python Essential Reference.

Notes: 

More fields may be available via dynamicdata ..