Title: Combining the STL with SAX and XPath for Effective XML Parsing

Author:

Date: 03 October 2003 13:16:00 +01:00 or Fri, 03 October 2003 13:16:00 +01:00

Summary:

Body:

There are two main methods in common usage when parsing an XML document: The Document Object Model, and the Simple API for XML (SAX). Parsers that support the first method read the whole document into a data structure in memory, then provide access to it using the W3C's DOM API. This requires that the whole document fits into memory, and takes a little time while the parsing is done. Furthermore the user then has to navigate the DOM tree to gain access to the data in the document.

The second method is event driven, in that the parser calls user-supplied event handlers as it encounters occurrences of various parts of the XML document, such as Elements, Text and so on.

This article describes an efficient way to parse an XML document, using standard C++ library containers in conjunction with a SAX parser, resulting in fast de-serialisation of data from an XML file directly to data structures held in memory.

XML data may represent a variety of different kinds of data, plain character strings, integers, floating-point numbers and so on. A look at the W3C's XML-Schema recommendation shows the number of data types that have been anticipated and provided by this standard. We need a way to read from an XML text element into any one of a number of C++ data types. Ideally this should also be extensible for user-defined types.

We can achieve this by the use of a polymorphic Element class, which has the ability to convert and store textual XML data in any data structure the user wishes:

class Element {
public:
  virtual void put(const std::string& text)=0 const;
  virtual ~Element() {}
};

Clearly this is a base class, as evidenced by the pure virtual method " put", and the virtual destructor which ensures proper behaviour if we delete objects of classes derived from this one via a base type pointer.

For each type of data we wish to be able to parse from the XML we need a new, derived, class with an appropriately-typed data member pointing at a variable suitable to hold the data item. There are two ways we could do this.

Explicitly deriving concrete classes

We can derive a specific class for each data type we need to parse. It must include a suitably overridden put method that can convert the character data from the XML document into the specific data type we need.

For example, a class for type long would look like this:

class LongElement : public Element {
  long* ptr_;
public:
  LongElement(long* ptr) : ptr_(ptr) {}
  virtual void put(const std::string& text) const {
    if (ptr_) {
      *ptr_=atol(text.c_str());
    }
  }
};

We would need to define a different derived class for each data type on which we want to operate.

Using a class template

I said there were two ways to do this, and we do have a choice of how to implement this: Inheritance polymorphism, or parametric polymorphism. The latter is more commonly known as "templates" in C++. Given that we are only changing the type of data operated on, why don't we choose to implement this as a class template?

The crucial factor is the design of the put method. Since each derived class handles a different type of data we need a way to code this function in such a way that we don't need to specialise the template for each type - which would negate the advantage of using a template. The ideal way would be to use a conversion function, which itself is a template. Luckily we have such functions in the standard iostream library, which makes its business the conversion of varied data types to and from character streams, such as we might find in an XML document.

The template version of the derived class looks like this:

template<typename T>
class ElementData : public Element {
  T& ref_;
public:
  ElementData(T& item) : ref_(item) {}
  virtual void put(const std::string& s) const {
    std::istringstream stream(s);
    stream >> ref_;
  }
};

You can see that, like the non-templated version, this class overrides the put method to put the XML data it is passed into the data storage it was given in its constructor. However this class can cope with any type for which a stream extraction operator has been defined.

Now, we have a number of classes (i.e., different instantiations of the template) that can hold a reference to a data item (a program variable, in other words). How are we going to manage objects of these types and how will they fit into the SAX parsing methodology?

Remembering that the SAX parser will call our handler object for the start and end of each element, and each piece of character data in the XML document, we need to arrange that the put method of the appropriate object be called at the appropriate time with each piece of data.

The best way to do this is to use a look-up table that will direct us to the relevant ElementData object for each piece of XML text. For this we will use the look-up table data structure supplied with the C++ standard library, the STL map class template.

Recall that the std::map takes two main template arguments, and these are the types of the key into the map, and the type of the item to be stored. In this case these are std::string and pointer to Element respectively. Let's declare a typedef to make our lives easier:

typedef std::map<std::string, const Element*> ElementMap_t;

So what do we use for the key of the map? Since we are mapping from each XML element to its data item, the key must identify the XML element in question. This means we should use its name as the key.

Let's put all this together and see how it can be used to parse the following simple XML document:

<?xml version='1.0'>
<Person>
  <FirstName>Elvis</FirstName>
  <LastName>Presley</LastName>
  <DateOfBirth>
    <Year>1935</Year>
    <Month>1</Month>
    <Day>8</Day>
  </DateOfBirth>
</Person>

We want to store each individual data item in a separate variable, each with its own ElementData<> object with the template instantiated for the appropriate type:

std::string FirstName;
std::string LastName;
struct Date {
  int year,month,day;
};
Date dob;
ElementMap_t element_map;

element_map.insert(std::make_pair
  ("FirstName",
  ElementData<std::string>(FirstName));
element_map.insert(std::make_pair
  ("LastName",
  ElementData<std::string>(LastName));
element_map.insert(std::make_pair(
  "Year",
  ElementData<int>(dob.year));
element_map.insert(std::make_pair(
  "Month",
  ElementData<int>(dob.month));
element_map.insert(std::make_pair(
  "Day",
  ElementData<int>(dob.day));

These can be wrapped in a set of overloaded functions to make it easier, or may even be automatically called by some program that gets the information from a metadata repository of some kind.

Then we need a SAX parser with a handler that looks up each XML element in the map, and calls the put method on the object it finds there, passing the character text from that element.

XPath

The element names we have used in the element map above are hardly descriptive. What if our XML document has more than one date, a start and an end of a period for example? We need a more specific way to identify the element in the document. This is what XPath was designed to do.

XPath allows us to specify an element in an XML document using a hierarchical directory path-like notation. The root of the document is represented by a slash, and each element name is appended, separated by more slashes.

Some examples from the document above:

/Person
/Person/FirstName
/Person/DateOfBirth/Month

XPath is much more expressive than this, but we are going to use this simple form of the notation to identify individual elements of our XML document.

The SAX Handler

There are several XML parsers around that include SAX capabilities. Here we will use the Xerces C++ parser from the Apache project as an example, although the technique could just as easily be applied to any other SAX parser. The handler class here derives from the Xerces HandlerBase class.

The SAX handler is the piece that does all the work. It has a number of methods that are called by the parser as the XML document is processed. In this case we are concerned with the beginning and end of XML elements, and with character text. We use the beginning and end element notifications to keep a record of where we are in the XML document, constructing an XPath string as we go along. This path to the current element is stored on a stack. Any character data for the active element is accumulated until we come to the end of the element. When we do reach the end of an element we pop the top item off the stack, so that the previous element's path becomes the active one.

Here is the declaration of the class:

class MySaxHandler : public HandlerBase {
  const ElementMap_t& element_map_
  std::stack<std::string> current_path_;
  std::ostringstream current_text_;

public:
  MySaxHandler(const ElementMap_t& map);
  void startElement(const XMLCh *name,
                    const AttributeList atts);
  void endElement(const XMLCh *name);
  void characters(const XMLCh *text);
};

The constructor simply initialises the object's member variable with a reference to the element map.

MySaxHandler::MySaxHandler
  (const ElementMap_t& map)
  : element_map_(map) {}

Handler Methods

First, startElement registers the start of a new XML element and adds it to the XPath name on our stack:

void MySaxHandler::startElement
                   (const XMLCh* const name,
                    AttributeList& atts) {
  std::ostringstream this_path;
  if (!current_path_.empty()) {
    this_path<<current_path_.top();
  }
  this_path << '/';
  write_xml(this_path, name);
  current_path_.push(this_path.str());
}

The reason for using a stringstream rather than a simple string is so that we can take advantage of the insertion function we will see shortly; this will handle conversion of XMLCh unicode characters to our local encoding. However, for reasons we shall soon see, this needs to be an explicit write_xml function rather than an overloaded operator<<.

Next, characters is called by the SAX parser for all textual element content. We simply maintain a stringstream and insert the new characters into it whenever we get some.

void MySaxHandler::characters(
                   const XMLCh*const text,
                   const unsigned int length) {
  write_xml(current_text, text);
}

Finally, at the end of each element, endElement finds the element in question by looking up its XPath name in the map and then calls put to write the characters saved so far to the stored variable reference:

void MySaxHandler::endElement(
                   const XMLCh* const name) {
  if (!current_path_.empty()) {
    ElementMap_t::const_iterator i
             = element_map_.find(
                   current_path_.top());

    if (i != element_map_.end())
      i->second->put(current_text_.str());

    current_path_.pop();
    current_text_.str("");
  }
}

XML Character encoding

The XML standard allows you to represent the characters that make up an XML document in any encoding you like. There are a number of rules used by XML parsers to determine the correct encoding, including the "encoding=" attribute on the "<?xml ?>" declaration at the beginning of the document.

The Xerces SAX parser represents characters using an XMLCh data type, and passes us strings by pointers to this character type.

These XML characters are represented in a Unicode encoding whereas to store them in standard strings we need them in our local encoding. Xerces provides a static XMLString::transcode() function to do this conversion. The conversion could be automated by building it into the insertion operator for the type XMLCh.

However, XMLCh is a typedef from short which makes it difficult - you can't overload based on a typedef because typedef does not create new type, but simply an alias. Therefore the standard inserter for short will be used by the compiler instead. To get around this problem there are a couple of alternatives: Use a different function to insert in the stream (rather than operator<<) or explicitly translate the encoding before inserting in the stream.

Here is the function write_xml used earlier to transcode and insert XML characters into a stream:

void write_xml(std::ostream& target,
               const XMLCh* s) {
  char *p = XMLString::transcode(s);
  target << p;
  delete [] p;
}

To avoid the call to delete[] you could replace the char* with a smart pointer capable of holding and deleting an array (unlike std::auto_ptr). If target<<p could throw, for example, this would be necessary to make the function exception-safe.

Using the SAX processor

With the handler class in place and now using XPath style element names, we can rewrite the parsing code. First, add a helper function to make it easier to add an element to the map:

template<typename T>
void AddElement(ElementMap_t& map,
                T* ptr,
                const std::string& path) {
  map.insert(std::make_pair
             (path, new ElementData<T>(ptr)));
}

The final code looks like this:

char filename[]="file.xml"
ElementMap_t element_map;

AddElement(element_map, &FirstName,
  "/Person/FirstName");
AddElement(element_map, &LastName,
  "/Person/LastName");
AddElement(element_map, &year,
  "/Person/DateOfBirth/Year");
AddElement(element_map, &month,
  "/Person/DateOfBirth/Month");
AddElement(element_map, &day,
  "/Person/DateOfBirth/Day");

MySaxHandler handler(element_map);
parser.setDocumentHandler(&handler);
parser.parse(filename);

Further refinement

An obvious enhancement is to enable multiple occurrences of the same element name in the XML document to store data in corresponding multiple variables, something which is not catered for by the code I have presented here.

Another way in which this design could be extended would be to support multiple XML document types. Since the element map objects contain full XPath names for each element, the same map could be used, and it would continue to uniquely identify each element as it is discovered in any known XML document type.