Title: Stream-Based Parsing in C++

Author:

Date: 02 August 2003 22:56:32 +01:00 or Sat, 02 August 2003 22:56:32 +01:00

Summary:

Body:

This paper shows how to implement general parsers as a family of streams. This allows for very readable, maintainable and flexible parsers. The method is illustrated with a parser for simple arithmetic expressions, but can easily be extended to a parser for a full-fledged programming language. Moreover, the same technique can be applied to the entire process from lexing to execution, since actions can be associated with each sub-parser.

Introduction

The parsing of input is a very important problem appearing in many different parts of software development - parsing user input in the form of command-line options, the parsing of arithmetic expressions in a calculator, parsing values in a user-defined configuration file or compiling some programming language.

This makes it important to have different approaches. What we will present here in this paper, is a method of parsing inspired by what is done in functional programming (FP). The paper is not about functional programming in C++Moreover, the presence of too many parentheses makes the code harder to read and is hence also more error prone.^[1], rather it is about how to implement a particularly elegant idea from FP in an object oriented context.

The entire process, from lexical analysis to actual execution of a program can be divided into a number of individual steps:

source -> lexer -> parser -> optimiser
       -> execution -> output

In FP, this could be represented by a family of functions:

output = execution ^o optimiser ^o parser
         ^o lexer(source)

The advantage of this approach is flexibility: it is easy to omit or modify individual steps, which is important for playing around with different approaches in language design or compiler construction.

One could attempt to implement the different sub-parsers as functions or (better) function objects in C++. The disadvantage of doing so, however, is the proliferation of parentheses this would entail. In C++ there is no operator for function (or functor) composition corresponding to ^o above (which may be called many different things in a functional language - e.g. a period or simply the letter 'o'). Nor can such an operator be defined in a natural way - none of the overloadable operators are well suited for becoming composition operators^[2].

Moreover, the presence of too many parentheses makes the code harder to read and is hence also more error prone.

One can consider a function as a stream, however. Instead of writing x=f(y) one could try to write x << f << y. Here, we do have a natural operator for composition, namely the stream operator <<. This suggests considering the entire process as a collection of streams:

output << execution << optimiser
       << parser << lexer << source

This won't work, however, since << is left-associative. Hence, either one needs to introduce parentheses once more, e.g., write x << (f << y) and so on, or one will need to use another, right-associative operator. The former case will automatically suffer from the abovementioned problems with using function objects.

Consequently, we will have to use a right-associative operator instead. Unfortunately, C++ does not allow one to declare an operator to have a user-defined associativity, so instead we will have to replace << by one of the right-associative operators defined by C++. There are very, very few of these. In fact, the only binary rightassociative operators are the assignment operators +=, *=,.... Thus, the simplest possible change is to use operator<<=. This also has the added advantage of resembling an arrow, more showing the direction of the flow of data.

Now, even though strictly speaking the sub-parsers won't be stream objects, we will still refer to them as such by analogy to ordinary streams (I/O stream, file streams, string streams) present in C++. The reason being that the defining feature of a stream is its ability to process input and to be pipelined, which is precisely what our generalised stream will do.

In this paper, we will concentrate on the parsing step, but in such a way that the remaining steps of the process could be implemented in a similar fashion. In fact, we will see how to make a simple modification to our parser and turn it into an expression calculator. For concreteness, we will consider a particular example, which is simple enough not to introduce unnecessary complications yet complex enough to be non-trivial. The chosen example is the parsing of arithmetic expressions like 1+(2-3)*4. We will only consider integers^[3]. Furthermore, we will not worry about the precedence levels of the standard arithmetic operators - this could be done by slightly modifying the grammar as shown in for instance, [Martin]. It will be shown later how to accommodate this with very few changes to our framework.

Hence, the basic ingredients in our language are:

<digit>    ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 |
               7 | 8 | 9
<operator> ::= + | - | * | /
<lpar>     ::= (
<rpar>     ::= )

To this we add the definition of a number as a sequence of one or more digits:

<num> ::= <digit>+

Now, a simple expression is either a number on its own or it is two numbers separated by an operator:

<simexp> ::= <num> | <num> <operator> <num>

Such an expression is the smallest string composed from the symbols which makes sense in this simple language. A general expression can then be built like this:

<exp> ::= <simexp> | <lpar> <exp> <rpar>
                   | <exp> <operator> <exp>

Actually, the simple expression is redundant and could be omitted by replacing the first option in the production for <exp>. In any case, the set of rules above constitute the entire grammar.

The basic idea in the FP-style of parsing is to split-up the parser into a family of sub-parsers each corresponding to a term in the grammar, with special combinators corresponding to ways of combining these sub-parsers (there will be one for combining them - often just standard function composition in an FP-language - and one for representing options in the BNF grammar), see [Hutton, Fokker, Peyton-Jones-]. For a way to implement part of this in another multi-paradigm language, Perl, see [Antonsen].

In the functional language Miranda, for instance, one could write a parser for this simple language like this (following [Hutton] and ignoring problems with left-recursive grammars for sake of illustration):

exp = (exp $then operator $xthen exp) $alt
      (literal '(' $then exp $xthen
      literal ')') $alt simexp

Here, exp, then, xthen, alt, literal and simexp are all sub-parsers (the prefix '$' turns any function into an infix operator in Miranda, and brackets around arguments are optional in Miranda as well as in Haskell and ML). The sub-parser operator simply parses operators, while the sub-parser literal parses the literal string given as the argument, and alt is a sub-parser representing alternatives as given by the symbol '|' in the BNF grammar. We won't be able to reach the same level of conciseness in this C++ implementation though, but we will try to get as close as possible.

Such parsers may not be as efficient as the ones generated by general high-quality parser-writing tools such as lex and yacc (and their various relatives such as flex and bison from GNU), but they have other advantages:

Readability:: By splitting the parser up into a number of subparsers each corresponding to a term in the BNF grammar, there will be a much closer relationship between the structure of the entire parser and the original BNF grammar.
Maintainability: Since the syntax is fairly straightforward and closely mirrors the corresponding BNF grammar, such parsers are easy to maintain.
Flexibility:: The same splitting-up also implies that productions can be added or omitted very easily. Hence, such parsers are extendable. Furthermore, different versions of the various subparsers can be tried out.

For simplicity, we will assume that the lexing has been done and resulted in an array of single characters. This is of course a rather trivial lexing (which, moreover, is easy to implement) - most lexers would return not a list of characters but a list of tokens. In order to show the power of the technique, however, it is advantageous to consider such trivial lexers. Such a lexer, for instance, could be implemented by simply reading from a file, one character at a time, returning a list of characters read when done.

This paper will be structured as follows: First, the abstract base class is defined. This is a very basic class skeleton, but will form the foundation of all the richer sub-parsers to be defined later. At the same time we define the basic parse tree class and other related datatypes. Secondly, we introduce some simple general utilities (a Boolean function for testing for digits and two list processing functions found in all functional languages). Next, we define our first generic sub-parsers for numbers and operators. The fourth step is to define general parser combinators, allowing us to combine subparsers to generate new types of sub-parsers.

All of these steps are completely general and form a basic parsing library. We then turn to using these general tools to actually parse integer arithmetic expressions. This turns out to be very easy and there is a very, very close relationship between the BNF grammar and the actual implementation of the sub-parsers as promised.

Finally, we discuss various ways to refine the framework.

The basic stream class

We will begin by defining a general parse stream or pstream. The various sub-parsers will then all be derived from this base class. The parser will need to keep track of a list which is passed on between consecutive parses. Even though, we will only consider lists of single characters here, it is worthwhile to work with a more general set-up, namely that of lists of strings.

Our pstreams will have a state, containing the parse tree constructed so far. In order to be able to pipe pstreams together, thereby building more complex parsers from simple sub-parsers, we will also need the pstream to keep track of the remaining tokens.

Hence, we define a new data type:

typedef std::pair<Ptree, List> Presult;

where Ptree is the class defining parse trees and where List is defined by:

typedef std::list<std::string> List;

which will be our basic data structure. Similarly, Ptree is a specialisation of a more general parse tree:

typedef ParseTree<std::string> Ptree;

The ParseTree class is a simple binary tree:

template <class T>
class ParseTree {
private:
  ParseTree *left; // left sub tree
  ParseTree *right; // right sub tree
  T root;
public:
  ... // constructors & destructor
  bool isEmpty() const { return root==T(); }
  bool isLeaf() const
               { return left==0 && right==0; }
  T getRoot() const { return root; }
  ParseTree* getLeft() const { return left; }
  ParseTree* getRight() const
               { return right; }
  void setLeft(ParseTree& lft) { left = &lft; }
  void setRight(ParseTree& rgh)
               { right = &rgh; }
  void setRoot(T rt) { root = rt; }
  void update(T); // used for operators
  void insert(T); // used for numbers
  void insert(ParseTree&);
               // used for parentheses
};

The member functions update and insert are there to allow parsers to manipulate the parse tree. The former adds a node as the root, copying the old tree to the left sub-tree, this is to be used when parsing binary operators. The latter inserts a node at the first empty sub-tree it finds. This is used for parsing either numbers or parentheses.

Using just recursion these methods could be implemented simply like this:

template<class T>
void ParseTree<T>::update(T val) {
  if(isEmpty())
    throw std::logic_error(
           "Syntax error: Missing operand");
  ParseTree* newLeftSubtree
           = new ParseTree(*this);
  root = val;
  left = newLeftSubtree;
  right = 0;
}

template<class T>
void ParseTree<T>::insert(T val) {
  if(isEmpty()) {
    setRoot(val);
    return;
  }
  if(getLeft() == 0) {
    setLeft(*new ParseTree(val));
    return;
  }
  else {
    if(getRight() == 0) {
      setRight(*new ParseTree(val));
      return;
    }
    getRight()->insert(val);
               // use right recursion
  }
}

template<class T>
void ParseTree<T>::insert(ParseTree<T> &pt) {
  if(isEmpty()) {
    setRoot(pt.getRoot());
    setLeft(*pt.getLeft());
    setRight(*pt.getRight());
    return;
  }
  if(getLeft() == 0) {
    setLeft(pt);
    return;
  }
  else {
    if(getRight() == 0) {
      setRight(pt);
      return;
    }
    getRight()->insert(pt);
               // use right recursion
  }
}

The if-statements in the above implementations are the C++ analogue of the argument pattern matching of functional languages^[4], they are needed to control the recursive steps and to ensure termination. The code above can be shortened a little bit and most of the return-statements can be omitted if on instead uses nested if-clauses. While this will make the code a few lines shorter, I think the present coding style has the advantage of clearly showing the reader that the function returns after having handled each special case.

Both methods first handle the case of an empty tree. For insert, the value passed simply becomes the new root, whereas for update an exception is thrown. Next comes the case of a leaf, i.e., a node without children. Here, update inserts the passed value as a new root moving the leaf to the left subtree, while insert merely inserts its value as the left subtree rooted at the original leaf node. If the left subtree is occupied, insert first tries the right one, if this is also non-empty it uses recursion to find the first empty subtree.

The abstract base class, pstream is now:

class pstream {
protected:
  Presult pres; // contents so far
public:
  ... // constructors & virtual destructor
  List const& getList() const
           { return pres.second; } // accessor method
  Presult const& getPres() const
           { return pres; } // accessor method
  virtual Presult operator<<=
           (const List &) = 0;
  virtual Presult operator<<=
           (const Presult&) = 0;
};

Notice that only the accessor methods, getList and getPres, are non-trivial and that the streaming operator, operator<<= is a pure virtual function.

Writing the parser

We will now turn to the question of actually writing the parser by splitting it up into a number of sub-parsers as stated in the introduction.

Utility functions

It is advantageous to define a number of utility functions representing the basic ingredients of the grammar. This can be done by defining a few boolean functions like this:

bool isDigit(std::string c) {
  return c == "0" || c == "1" || c == "2"
         || c == "3" || c == "4" || c == "5"
         || c == "6" || c == "7" || c == "8"
         || c == "9";
}

with a similar function isOperator for testing whether a character is an operator. In general, one should write one such Boolean utility function for each type of symbol in the language. Hence, we define two such functions, isDigit and isOperator. It is, of course, also possible to define a Boolean function testing for brackets but since we only have one type of these it is much easier to simply insert a test c == "(" or c == ")" directly in the code than define special functions isLPar and isRPar respectively.

Of course, a similar result could be obtained by using built-in functions such as isdigit but the advantage of writing them ourselves is the ability to illustrate the general principle.

We will also need a pair of basic list processing functions available in all FP languages. The first is for extracting the head of a list, i.e., its first element. This function is called first, car or hd in various functional languages (Common Lisp, old Lisp, and ML respectively). Here we will settle for the name used in Haskell:

string head(List ls) {
  return ls.front();
}

Similarly, for extracting the tail of a list, i.e., the remaining elements, we will write a wrapper around one of the STL list member functions:

List tail(List ls) {
  ls2.pop_front(); // remove head
  return ls2;
}

The tail function also goes under various names in the FPcommunity, e.g. rest, cdr or tl in Common Lisp, old Lisp and ML respectively. Once more, we have settled for the Haskell name. Note, by the way, that none of these actually change the list passed to them - we could have declared the arguments const, but we would then have to use const_cast<> all the time, whenever we want to call them, and this would be rather tedious.

These are the only two list utility functions we will be needing.

The sub-parser for numbers

The entire code for the sub-parser for numbers is very simple, illustrating the ease of the functional-style parsing. It is simply:

class pNum : public pstream {
public:
  pNum() {};
  pNum(const List & ls)
         { pres = make_pair(Ptree(),ls); };
  pNum(const Presult &pt) { pres = pt; };
  pNum() {};
  inline Presult operator<<=(const List &);
  inline Presult operator<<=(const Presult &);
};

inline Presult pNum::operator<<=(
                           const List &ls) {
  return operator<<=(make_pair(Ptree(),ls));
}

inline Presult pNum::operator<<=(
                           const Presult &pr) {
  List ls = pr.second;
  if(ls.empty())
    return pr; // nothing to parse
  std::string c = head(ls);
  if(!isDigit(c)) {
    return pr; // not a number, do nothing
  }
  else {
    std::string val(c);
    ls = tail(ls);
    while(isDigit(c=head(ls)) && !ls.empty()) {
      val += c; // construct number
      ls = tail(ls);
    }
    Ptree pt(pr.first);
    pt.insert(val);
    return make_pair(pt,ls);
  }
}

This sub-parser illustrates the basic idea of FP-style parsing using streams: To parse a particular expression-type, write a very simple parser (basically just a copy of the abstract base class) and implement the corresponding operator<<= for acting upon a reference to a Presult-object.

The sub-parser for operators

The sub-parser for operators is almost identical, with the only significant change taking place in operator<<=:

inline Presult pOp::operator<<=
                    (const Presult &pt) {
  List ls = pt.second;
  if(ls.empty())
    return pt; // nothing to parse
  std::string c = head(ls);
  if(!isOperator(c))
    return pt;
  Ptree ptt(pt.first);
  ptt.update(c);
  return make_pair(ptt,tail(ls));
}

The main difference between parsing numbers and operators is the way the parse tree is updated. Parsing numbers simply inserts the number at the first available empty place, whereas operators have to identify their operands first and hence use the update method instead. Numbers are typically leaves while operators are roots of subtrees.

Parser combinators

It is convenient to be able to handle combinations of sub-parsers as well, e.g. to indicate that a particular token must always come after another (sequencing) or to indicate a choice between two tokens (alternatives). In the BNF grammar, such combinators are represented by concatenation and by |, respectively.

Combinators are higher-order parsers, taking parsers as arguments and returning new parsers. In typical FP-implementations, in Haskell, say, these would often be represented by currying, i.e., by partial binding of arguments^[5]. In C++, however, it is more convenient to define new objects. As an illustration, we will define the following combinators pAlt, for handling alternatives, pThen, for handling sequencing, and finally pMore for handling one or more occurrences of a token. We will also define some syntactic sugar by overloading the operators ||, && and ++, respectively, to be alternative methods to use these combinators.

The combinators are just parsestreams, although higher-order, hence they are defined by deriving from the abstract base class and by implementing operator<<=. Since they have to be able to deal with all kinds of parsers, they are defined using templates. In a sense, one can think of templates as C++'s analogue of such higher order constructs. One might call them "higher order objects".

For instance, the combinator for alternatives can be written like this:

template <class T, class S>
class pAlt : public pstream {
private:
  T p1;
  S p2; // component pstreams
public:
  ... // constructors & destructor
      // & accessors & operator<<=
};

This class differs from the previous parsestreams in having two pstreams as internal data members (together with the appropriate accessor methods). These are just defined, for simplicity, as general classes of types T and S respectively, relying on compiletime polymorphism to ensure that these can be used as pstreams.

To actually be able to use this combinator, we must implement operator<<=. This can be done as follows:

template<class T, class S>
inline pAlt<T, S> operator||
                (const T& p1, const S& p2) {
  return pAlt<T, S>(p1,p2);
            // alternative way of creating pAlt objects
}
// definition of how pAlt works is given
// by operator<<=
template<class T, class S>
inline Presult pAlt<T, S>::operator<<=
                (const Presult& pr) {
  Presult lp1(p1 <<= pr), lp2(p2 <<= pr);
            // use component parsers
  List l1=lp1.second, l2=lp2.second;
  return (l1.size() <= l2.size())? lp1 : lp2;
            // return best parse
}

where we have also taken the opportunity to add some syntactic sugar by providing p1 || p2 as an alternative syntax for constructing a pAlt-object out of two sub-parsers.

The implementation of operator<<= is a bit naive, but will suffice for the present case. In general, one may need a more complicated criterion for "best parse" than simply returning the parse resulting in the smallest list of remaining tokens, but in our case it will suffice.

Similarly, the code for pThen, the sequencing combinator, is:

template<class T, class S>
class pThen : public pstream {
private:
  T p1;
  S p2; // component parsers
public:
  ... // usual stuff
};

template<class T, class S>
inline pThen<T, S> operator&&
                 (const T& p1, const S& p2) {
  return pThen<T,S>(p1,p2);
    // alternative syntax
}
template<class T, class S>
inline Presult pThen<T, S>::operator<<=
                 (const Presult& pr) {
  return p2 <<= p1 <<= pr;
}

Finally, the combinator for one or more occurrences of a token, which would be expressed by a regular expression in BNF, (token)+, can be written as:

template<class T>
class pMore : public pstream {
private:
  T p; // component parser
public:
  ... // usual stuff
};

template<class T>
inline Presult pMore<T>::operator<<=
                   (const Presult& pr) {
  Presult pr2(p <<= pr);
  List ls = pr.second;
  List ls2 = pr2.second;
  while(!ls2.empty() && (ls2.size()
                 < ls.size())) { // something was parsed
    pr2 = (p <<= pr2); // continue parsing
    ls = ls2;
    ls2 = pr2.second;
  } 
  return pr2;
}
template<class T>
inline pMore<T> operator++(const T pp) {
  return pMore<T>(pp);
}

Here, we have settled for a straightforward "procedural" implementation, one could also have used recursion (as one would almost certainly have done in FP) in the definition of operator<<=.

In a sense, these combinators together with a parser for literal substrings, pLit, which we haven't given but which is almost identical to pNum and pOp earlier, are all we need to parse anything. Parsing numbers, for instance, could symbolically be written as:

pNum = ++(pLit("0") || pLit("1") || pLit("2")
       || pLit("3") || pLit("4") || pLit("5")
       || pLit("6") || pLit("7") || pLit("8")
       || pLit("9"));

which is more or less what one would have written in a functional language such as Haskell or Miranda. C++, however, does not allow one to define new classes from old ones in this manner, hence one would have to put this definition of pNum into the implementation of operator<<=. In the next section we will see that this is precisely what we will do to parse general expressions.

There is one more reason, however, one cannot simply use the above trick to write pNum. Parsing numbers, as well as operators, one needs to perform certain actions (inserting or updating the parse tree). We will later see how to associate actions to parsers, but for now we will have to make do with custom-written parsers pNum and pOp explicitly taking care of the necessary parse tree manipulations.

The sub-parser for bracketed expressions

This is slightly more complicated. It turns out to be advantageous to consider bracketing as a kind of parser combinator. Hence, given a parse stream p we will define a new pstream type just like we did for pAlt and the other combinators. This, by the way, is one way in which our implementation differs from the normal functional approach.

The definition of the bracketing combinator, pBrack, is very similar to the previous combinators:

template<class T>
class pBrack : public pstream {
private:
  T p; // internal sub-parser
  Presult pr2;
public:
  ... // usual stuff
};

As usual, all of the work is done in the overloaded operator<<=:

template<class T>
inline Presult pBrack<T>::operator<<=
                   (const Presult &pr) {
  if( (pr.second).empty() )
    throw logic_error("Syntax error.
                      Closing bracket expected!");
  std::string c = head(pr.second);
  if(c == "(") {
    Presult prr( p <<= make_pair
                   (pr.first, tail(pr.second)) );
    // apply internal parser
    (pr2.first).insert(prr.first);
    // update internal parse tree
    return operator<<=
                  (make_pair(pr.first, prr.second));
  }
  if(c == ")") {
    Ptree pt(pr.first);
    pt.insert(pr2.first);
    // insert parsed subtree
    return make_pair(pt,tail(pr.second));
    //skip closing bracket & terminate recursion
  }
  return pr; // do nothing
}

The only subtlety is in maintaining an internal, temporary parse tree. Upon returning, this internal parse tree will hold the parse tree for the entire bracketed expression, which can then be inserted into the full parse tree. This pstream mimics the way human beings often read parenthesised expressions - one sees an opening bracket and immediately one begins to parse the internal expression until one sees a matching end bracket. If nested brackets are found, one resorts to recursion.

Putting it all together

We now have all the ingredients needed for parsing integer arithmetic expressions. One could put the previous classes and functions into general header files to be used for all parsers, and then proceed to write parsers for the specific language at hand, which is what we will turn to now.

Now, the grammar is recursive and while C++ is certainly capable of handling recursion (in fact, we have used it frequently in this paper), the language is not really capable of handling recursive definitions such as the grammar in its present state. Hence, we will have to introduce an extra layer of indirection. Rewrite the grammar as:

<aexp> ::= <num> | <num> <op> <num>
<sexp> ::= <aexp> | <lpar> <aexp> <rpar>
<exp>  ::= <sexp> | <lpar> <sexp> <rpar>
           | <sexp> <op> <sexp>

The idea is now simply to write three parsers pAExp, pSExp and pExp. These classes are completely trivial, with all the work being done in operator<<= as usual.

The first sub-parser is:

inline Presult pAExp::operator<<=
                     (const Presult &pr) {
  pOp po;
  pNum pn;
  return pn || (pn && po && pn) <<= pr;
}

Notice the simple expression in the last line. It is a simple, faithful reflection of the definition of the corresponding term in the BNF grammar.

The remaining sub-parser, pSExp, and final parser, pExp, are very simple and only their non-trivial operator<<= function will be given.

For pSExp the code is simply:

inline Presult pSExp::operator<<=
                    (const Presult &pr) {
  pAExp pa;
  return pa || pBrack<pAExp>(pa) <<= pr;
}

Which, once more, is a faithful representation of the corresponding definition in the BNF grammar.

Finally, the full parser, parsing general integer arithmetic expressions is simply:

inline Presult pExp::operator<<=
                   (const Presult &pr) {
  pSExp ps;
  pOp po;
  return ps || pBrack<pSExp>(ps)
            || (ps && po && ps) <<= pr;
}

Hence, with the generic sub-parser tools defined, writing a functional style parser using pstreams is an easy task, which makes it well suited for language experiments and for rapid prototyping.

With these definitions in place, to parse a list of characters, ls, using this parser, one simply writes:

pExp pe;
result = pe <<= ls;

where result is of type Presult. For a complete parse, this would contain a parse tree as first component and an empty list as second.

Refinements

The previous sections have shown one way to implement functional-style parsing in C++ using a generalisation of streams together with operator overloading. Incidentally, the heavy usage of operator overloading shows one advantage of C++ over a language like Java that does not support the overloading of operators. Most of the above could also be carried out in Java, but one would then have to use functions instead of operators resulting in harder to read code. Java also lacks C++'s features for generic programming (the templates), although a library providing some of this support is available. Some of the same effects could be mimicked in Java, however, by judicious use of the universal base class Object and frequent casting. Such an approach will not be as easy to read and will, moreover, be more error prone than the one presented here. Although a direct translation of the above C++ code into Java would not be satisfactory, one could instead use Java Beans - in some sense, these are able to model a behaviour similar to that of functions in a functional language.

On the other hand, C++ is not perfect in this respect either, since it lacks the possibility of user defined operators as well as a method to specify the associativity of an operator and whether it is to be applied as infix, postfix or prefix. Such possibilities are often available in FP languages, e.g. in ML and Haskell, and they are also likely to be available in future versions of existing languages such as Perl6.

The necessity of writing the execution from right to left lies in the standard OO-convention that y<<=f is really short for y.operator<<=(f) and hence treats the left hand operand differently from the right hand one. This is a consequence of OOP's dispatching rules. If one wanted to make left-to-right parsing work instead, one would have to define, inside all classes, methods for handling the different parsestreams. For instance, one would have to define List::operator>>=(T &p), where the typename T can be any valid parsestream subclass. This could of course be done with templates much the same way as was done for the combinators, but it would also necessitate the writing of a wrapper class for the List-object in order to be able to extend it this way. A much cleaner solution would be to use multi-methods. However, unlike Lisp's CLOS, these are not directly available in C++. There are ways around this, of course, C++ being after all a very flexible language, [Alexandrescu].

A problem not addressed at all in this paper is the problem of optimisation. Clearly, the parsestreams as developed here are optimised neither for speed of execution nor for space, but rather for speed of definition. By this phrase, it is to be understood that the method is intended for rapid prototyping and for experimenting with language features. Although the program isn't slow, it would certainly be advantageous to have the parsestreams run faster in real-life applications. One simple way of doing this is to pass a tree-iterator around keeping track of where the last insert/update of the parsetree occurred, for instance by pointing to the first empty subtree. This would speed up the insert and update operations.

Handling ambiguity and precedence

For simplicity, we did not consider operator precedence in the example above, i.e., 1+2*3 will be parsed as (1+2)*3 and not 1+(2*3). Of course, this could always be enforced by appropriate use of parentheses, but it would clearly be advantageous to conform closer to the likely expectations of the end-user.

It is well-known that operator precedence can be handled by slightly modifying the grammar (see e.g. [Martin]):

<expr>     ::= <term> + <term> | <term> - <term>
               | <term>
<term>     ::= <factor> * <factor>
               | <factor> / <factor> | <factor>
<factor>   ::= <number> | ( <expr> )

Thus, at the cost of adding new terms to the grammar, one can handle operator precedence in the expected way. It is trivial to change our family of sub-parsers to accommodate this, in particular since the recursive definitions have been removed in the same step.

Our parser stream framework worked well for the simple case of basic arithmetic expressions, but a general grammar is likely to be ambiguous with the parser effectively having to make certain choices at various stages of the parsing. Clearly, it would be of great interest to be able to handle this case too.

The standard way this is dealt with in FP is to replace the Presult data type with another one, [Hutton, Fokker, Peyton-Jones-].

In the face of ambiguity, what we have to deal with is not a single, unique parse, but rather a family of possible parses. Hence, the proper way of dealing with ambiguity is to introduce a new data type, let's call it LPresult (for List of Parse results). This is defined as:

typedef std::list< Presult > LPresult;

All the parsers must then return this data type instead. This, of course, necessitates some modifications. Each sub-parser must now act on all the elements of the list of possible parses. Consequently, the individual parse trees will have different sizes, in general with the largest one corresponding to an empty list of remaining tokens, and thus to a complete parse.

With these quite simple changes, our parse stream framework will be able to deal with ambiguous grammars as well.

Adding actions

The FP-style of parsing discussed in this paper opens up further modifications. In the arithmetic expression example, we saw how one of the types (the operators) had to perform some non-trivial operations on the parse tree.

This can be generalised. Instead of just allowing the simple manipulations involved in parsing binary operators and parentheses, one could allow more general actions to be associated with steps in a parse.

Various examples could be:

a pretty printer, printing out the code in a nicely formatted manner as the parse goes along;
cross-translation, translating each element of a parse into some code in another language, in the same way as yacc and similar tools do;
execution of the result as it goes along, in the case of arithmetic expressions this would be one very natural action to add, effectively turning the parser into a calculator.

To associate an action with a step in a parse, i.e., with a particular sub-parser, we need two things. First of all, we need a function to perform. This is done by using a function object, as this is the best way of passing around function definitions in C++. Secondly, we need an operator associating an action with a given sub-parser. This latter step will be done by overloading the >>= operator. Hence, given a sub-parser p and a functor f, we will define p>>=f to be the sub-parser p with the actions given by f added to it^[6]. This is consistent with interpreting the sub-parsers as streams: the output of one pstream is sent to the associated function object for further processing.

The definition of the corresponding sub-parser type, pUse, is just like the definitions of all the other combinators. Explicitly^[7]:

template<class T, class S>
class pUse : public pstream {
private:
  T p; // component parser
  S f; // function to apply
public:
  ... // usual stuff
};
template<class T, class S>
inline Presult pUse<T,S>::operator<<=(const
                 Presult& pr) {
  Presult pr2(p <<= pr); // apply parser
  return f(pr2); // apply function - MUST return Presult
}
template<class T, class S>
inline pUse<T,S> operator>>=
                 (const T &pp, const S &ff) {
  return pUse<T,S>(pp,ff);
                     // alternative syntax
}

To use this, one must then define a function object. For instance:

class printer {
public:
  Presult operator() (Presult &pr) const {
    std::cout << "Parse tree: ";
    (pr.first).print();
    std::cout << std::endl << " and ";
    prList(pr.second);
    return pr;
  }
};

assuming that a print method has been added to the ParseTree class and that the function prList prints a list in some appropriate format.

Let ls be a list of characters, one can then write:

pUse<pExp,printer> pu(pe,prn);
pu <<= ls;

where pe, prn are instances of pExp and printer respectively. This would then print the parse tree together with the list of unparsed characters. In fact, precisely such a construction was used during the test phase of developing the programs presented here.

Alternatively, one could write:

(pe >>= prn) <<= ls;

leaving the creation of the proper classes to C++. This latter syntax is probably as close as one will be able to get to the FPsyntax in C++.

Similarly, one could define a function object, executor, which executes the code found in the parse tree.

Finally, once pUse has been defined, one could redefine pNum and pOp in terms of it and pLit. The former would have to extract the inserted individual digits, combine them to form a number and re-insert this at the proper place in the parse tree, while the latter would have to extract the operator and call update.

Conclusion

We were able to extend the functional-style parsing using subparsers to C++ by defining a generalised class of streams, called pstreams. With this, we could define combinators allowing us to build new sub-parsers by combining old ones. With these general tools out of the way, it turned out to be very easy, once recursion had been handled properly, to implement the BNF grammar for integer arithmetic expressions. Moreover, similarly to the situation in FP, there was a very intimate relationship, in fact more or less an isomorphism, between the actual BNF production rule and the corresponding sub-parser, making it very easy to write such sub-parsers - it would even be feasible to write a general sub-parser generator to do so automatically.

It was also shown how to handle operator precedence and, perhaps even more importantly, ambiguous grammars. None of this involved major changes to the framework, thus the proposed framework scales well to more complicated grammars such as those of typical programming languages.

Moreover, we showed how one could associate actions to individual sub-parsers thereby dramatically extending the possibilities of stream-based parsing. The associated actions could be used to create a pretty printer, or even to translate or execute the code.

References

[Martin] J. C. Martin, Introduction to languages and the theory of computation, 2nd ed, McGraw-Hill (1997).

[Hutton] G. Hutton, "Higher-order functions for parsing", J. Funct. Prog. 2 (3), p323 (1992).

[Fokker] J. Fokker, "Functional parsers", Lect. Notes of the Baastad Spring School on Funct. Prog. (1995).

[Peyton-Jones-] S. L. Peyton-Jones, D. Lester, Implementing functional languages. A tutorial, Prentice-Hall (1992).

[Antonsen] F. Antonsen, "Functional programming in Perl", to appear in The Perl Review.

[Alexandrescu] A. Alexandrescu Modern C++ Design : Generic Programming and Design Patterns Applied, Addison-Wesley, 2001.

[Josuttis] N. M. Josuttis, The C++ Standard Library. A tutorial and reference, Addison-Wesley, 1999.

^[1] Although, C++ being a multi-paradigm language this would be a worthwhile topic.

^[2] That being said, it ought to be mentioned too that the Standard Template Library (STL) has introduced many FP features, among these a limited support for partial binding and the ability to write function composition operators, [Josuttis], although this still cannot be done by overloading a natural operator.

^[3] Floating point numbers could be accommodated with small changes, but this will introduce an unnecessary level of complexity

^[4] This "pattern matching" is really an advanced dispatch mechanism allowing the user to "overload" functions not just on basis of different types of arguments but also on different values.

^[5] C++ can also do this in a limited way using bind1st and bind2nd on function objects. These are available in the <functional> header file and provide one more example of FP-concepts being introduced into C++. See e.g. [Josuttis]

^[6] In monadic parsing in Haskell one often uses the sequencing operator >>= for precisely this purpose, hence we will use that here too.

^[7] In most FP languages one would say that f must be of type Presult -> Presult.

Notes:

More fields may be available via dynamicdata ..