    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: Combining the STL with SAX and XPath for Effective XML
Parsing</title>
        <link>https://members.accu.org/index.php/articles/1239</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Programming Topics + CVu Journal Vol 15, #5 - Oct 2003</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c65/">Programming</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c77/">CVu</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c106/">155</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c65-106/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c65+106/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;Combining the STL with SAX and XPath for Effective XML
Parsing</h1>
<p><strong>Author:</strong>&nbsp;</p>
<p>
<strong>Date:</strong> 03 October 2003 13:16:00 +01:00 or Fri, 03 October 2003 13:16:00 +01:00</p>
<p><strong>Summary:</strong>&nbsp;</p>
<p><strong>Body:</strong>&nbsp;<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e18" id="d0e18"></a></h2>
</div>
<div class="sidebar">
<p>This article appeared (slightly edited) in the January 2003
issue of The C/C++ Users Journal. It is reproduced here by kind
permission.</p>
</div>
<p>There are two main methods in common usage when parsing an XML
document: The Document Object Model, and the Simple API for XML
(SAX). Parsers that support the first method read the whole
document into a data structure in memory, then provide access to it
using the W3C's DOM API. This requires that the whole document fits
into memory, and takes a little time while the parsing is done.
Furthermore the user then has to navigate the DOM tree to gain
access to the data in the document.</p>
<p>The second method is event driven, in that the parser calls
user-supplied event handlers as it encounters occurrences of
various parts of the XML document, such as Elements, Text and so
on.</p>
<p>This article describes an efficient way to parse an XML
document, using standard C++ library containers in conjunction with
a SAX parser, resulting in fast de-serialisation of data from an
XML file directly to data structures held in memory.</p>
<p>XML data may represent a variety of different kinds of data,
plain character strings, integers, floating-point numbers and so
on. A look at the W3C's XML-Schema recommendation shows the number
of data types that have been anticipated and provided by this
standard. We need a way to read from an XML text element into any
one of a number of C++ data types. Ideally this should also be
extensible for user-defined types.</p>
<p>We can achieve this by the use of a polymorphic <tt class=
"classname">Element</tt> class, which has the ability to convert
and store textual XML data in any data structure the user
wishes:</p>
<pre class="programlisting">
class Element {
public:
  virtual void put(const std::string&amp; text)=0 const;
  virtual ~Element() {}
};
</pre>
<p>Clearly this is a base class, as evidenced by the pure virtual
method &quot; put&quot;, and the virtual destructor which ensures proper
behaviour if we delete objects of classes derived from this one via
a base type pointer.</p>
<p>For each type of data we wish to be able to parse from the XML
we need a new, derived, class with an appropriately-typed data
member pointing at a variable suitable to hold the data item. There
are two ways we could do this.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e42" id="d0e42"></a>Explicitly
deriving concrete classes</h2>
</div>
<p>We can derive a specific class for each data type we need to
parse. It must include a suitably overridden put method that can
convert the character data from the XML document into the specific
data type we need.</p>
<p>For example, a class for type <span class="type">long</span>
would look like this:</p>
<pre class="programlisting">
class LongElement : public Element {
  long* ptr_;
public:
  LongElement(long* ptr) : ptr_(ptr) {}
  virtual void put(const std::string&amp; text) const {
    if (ptr_) {
      *ptr_=atol(text.c_str());
    }
  }
};
</pre>
<p>We would need to define a different derived class for each data
type on which we want to operate.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e56" id="d0e56"></a>Using a class
template</h2>
</div>
<p>I said there were two ways to do this, and we do have a choice
of how to implement this: Inheritance polymorphism, or parametric
polymorphism. The latter is more commonly known as &quot;templates&quot; in
C++. Given that we are only changing the type of data operated on,
why don't we choose to implement this as a class template?</p>
<p>The crucial factor is the design of the put method. Since each
derived class handles a different type of data we need a way to
code this function in such a way that we don't need to specialise
the template for each type - which would negate the advantage of
using a template. The ideal way would be to use a conversion
function, which itself is a template. Luckily we have such
functions in the standard <tt class="literal">iostream</tt>
library, which makes its business the conversion of varied data
types to and from character streams, such as we might find in an
XML document.</p>
<p>The template version of the derived class looks like this:</p>
<pre class="programlisting">
template&lt;typename T&gt;
class ElementData : public Element {
  T&amp; ref_;
public:
  ElementData(T&amp; item) : ref_(item) {}
  virtual void put(const std::string&amp; s) const {
    std::istringstream stream(s);
    stream &gt;&gt; ref_;
  }
};
</pre>
<p>You can see that, like the non-templated version, this class
overrides the put method to put the XML data it is passed into the
data storage it was given in its constructor. However this class
can cope with any type for which a stream extraction operator has
been defined.</p>
<p>Now, we have a number of classes (i.e., different instantiations
of the template) that can hold a reference to a data item (a
program variable, in other words). How are we going to manage
objects of these types and how will they fit into the SAX parsing
methodology?</p>
<p>Remembering that the SAX parser will call our handler object for
the start and end of each element, and each piece of character data
in the XML document, we need to arrange that the put method of the
appropriate object be called at the appropriate time with each
piece of data.</p>
<p>The best way to do this is to use a look-up table that will
direct us to the relevant ElementData object for each piece of XML
text. For this we will use the look-up table data structure
supplied with the C++ standard library, the STL map class
template.</p>
<p>Recall that the <tt class="classname">std::map</tt> takes two
main template arguments, and these are the types of the key into
the map, and the type of the item to be stored. In this case these
are <tt class="classname">std::string</tt> and pointer to
<tt class="classname">Element</tt> respectively. Let's declare a
typedef to make our lives easier:</p>
<pre class="programlisting">
typedef std::map&lt;std::string, const Element*&gt; ElementMap_t;
</pre>
<p>So what do we use for the key of the map? Since we are mapping
from each XML element to its data item, the key must identify the
XML element in question. This means we should use its name as the
key.</p>
<p>Let's put all this together and see how it can be used to parse
the following simple XML document:</p>
<pre class="programlisting">
&lt;?xml version='1.0'&gt;
&lt;Person&gt;
  &lt;FirstName&gt;Elvis&lt;/FirstName&gt;
  &lt;LastName&gt;Presley&lt;/LastName&gt;
  &lt;DateOfBirth&gt;
    &lt;Year&gt;1935&lt;/Year&gt;
    &lt;Month&gt;1&lt;/Month&gt;
    &lt;Day&gt;8&lt;/Day&gt;
  &lt;/DateOfBirth&gt;
&lt;/Person&gt;
</pre>
<p>We want to store each individual data item in a separate
variable, each with its own <tt class=
"literal">ElementData&lt;&gt;</tt> object with the template
instantiated for the appropriate type:</p>
<pre class="programlisting">
std::string FirstName;
std::string LastName;
struct Date {
  int year,month,day;
};
Date dob;
ElementMap_t element_map;

element_map.insert(std::make_pair
  (&quot;FirstName&quot;,
  ElementData&lt;std::string&gt;(FirstName));
element_map.insert(std::make_pair
  (&quot;LastName&quot;,
  ElementData&lt;std::string&gt;(LastName));
element_map.insert(std::make_pair(
  &quot;Year&quot;,
  ElementData&lt;int&gt;(dob.year));
element_map.insert(std::make_pair(
  &quot;Month&quot;,
  ElementData&lt;int&gt;(dob.month));
element_map.insert(std::make_pair(
  &quot;Day&quot;,
  ElementData&lt;int&gt;(dob.day));
</pre>
<p>These can be wrapped in a set of overloaded functions to make it
easier, or may even be automatically called by some program that
gets the information from a metadata repository of some kind.</p>
<p>Then we need a SAX parser with a handler that looks up each XML
element in the map, and calls the put method on the object it finds
there, passing the character text from that element.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e108" id="d0e108"></a>XPath</h2>
</div>
<p>The element names we have used in the element map above are
hardly descriptive. What if our XML document has more than one
date, a start and an end of a period for example? We need a more
specific way to identify the element in the document. This is what
XPath was designed to do.</p>
<p>XPath allows us to specify an element in an XML document using a
hierarchical directory path-like notation. The root of the document
is represented by a slash, and each element name is appended,
separated by more slashes.</p>
<p>Some examples from the document above:</p>
<pre class="programlisting">
/Person
/Person/FirstName
/Person/DateOfBirth/Month
</pre>
<p>XPath is much more expressive than this, but we are going to use
this simple form of the notation to identify individual elements of
our XML document.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e121" id="d0e121"></a>The SAX
Handler</h2>
</div>
<p>There are several XML parsers around that include SAX
capabilities. Here we will use the Xerces C++ parser from the
Apache project as an example, although the technique could just as
easily be applied to any other SAX parser. The handler class here
derives from the Xerces <tt class="classname">HandlerBase</tt>
class.</p>
<p>The SAX handler is the piece that does all the work. It has a
number of methods that are called by the parser as the XML document
is processed. In this case we are concerned with the beginning and
end of XML elements, and with character text. We use the beginning
and end element notifications to keep a record of where we are in
the XML document, constructing an XPath string as we go along. This
path to the current element is stored on a stack. Any character
data for the active element is accumulated until we come to the end
of the element. When we do reach the end of an element we pop the
top item off the stack, so that the previous element's path becomes
the active one.</p>
<p>Here is the declaration of the class:</p>
<pre class="programlisting">
class MySaxHandler : public HandlerBase {
  const ElementMap_t&amp; element_map_
  std::stack&lt;std::string&gt; current_path_;
  std::ostringstream current_text_;

public:
  MySaxHandler(const ElementMap_t&amp; map);
  void startElement(const XMLCh *name,
                    const AttributeList atts);
  void endElement(const XMLCh *name);
  void characters(const XMLCh *text);
};
</pre>
<p>The constructor simply initialises the object's member variable
with a reference to the element map.</p>
<pre class="programlisting">
MySaxHandler::MySaxHandler
  (const ElementMap_t&amp; map)
  : element_map_(map) {}
</pre></div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e139" id="d0e139"></a>Handler
Methods</h2>
</div>
<p>First, <tt class="methodname">startElement</tt> registers the
start of a new XML element and adds it to the XPath name on our
stack:</p>
<pre class="programlisting">
void MySaxHandler::startElement
                   (const XMLCh* const name,
                    AttributeList&amp; atts) {
  std::ostringstream this_path;
  if (!current_path_.empty()) {
    this_path&lt;&lt;current_path_.top();
  }
  this_path &lt;&lt; '/';
  write_xml(this_path, name);
  current_path_.push(this_path.str());
}
</pre>
<p>The reason for using a <tt class="classname">stringstream</tt>
rather than a simple <tt class="classname">string</tt> is so that
we can take advantage of the insertion function we will see
shortly; this will handle conversion of <i class=
"parameter"><tt>XMLCh</tt></i> unicode characters to our local
encoding. However, for reasons we shall soon see, this needs to be
an explicit <tt class="function">write_xml</tt> function rather
than an overloaded <tt class="function">operator&lt;&lt;</tt>.</p>
<p>Next, <tt class="methodname">characters</tt> is called by the
SAX parser for all textual element content. We simply maintain a
<tt class="classname">stringstream</tt> and insert the new
characters into it whenever we get some.</p>
<pre class="programlisting">
void MySaxHandler::characters(
                   const XMLCh*const text,
                   const unsigned int length) {
  write_xml(current_text, text);
}
</pre>
<p>Finally, at the end of each element, <tt class=
"methodname">endElement</tt> finds the element in question by
looking up its XPath name in the map and then calls put to write
the characters saved so far to the stored variable reference:</p>
<pre class="programlisting">
void MySaxHandler::endElement(
                   const XMLCh* const name) {
  if (!current_path_.empty()) {
    ElementMap_t::const_iterator i
             = element_map_.find(
                   current_path_.top());

    if (i != element_map_.end())
      i-&gt;second-&gt;put(current_text_.str());

    current_path_.pop();
    current_text_.str(&quot;&quot;);
  }
}
</pre></div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e183" id="d0e183"></a>XML Character
encoding</h2>
</div>
<p>The XML standard allows you to represent the characters that
make up an XML document in any encoding you like. There are a
number of rules used by XML parsers to determine the correct
encoding, including the &quot;<tt class="literal">encoding=</tt>&quot;
attribute on the &quot;<tt class="literal">&lt;?xml ?&gt;</tt>&quot;
declaration at the beginning of the document.</p>
<p>The Xerces SAX parser represents characters using an
<span class="type">XMLCh</span> data type, and passes us strings by
pointers to this character type.</p>
<p>These XML characters are represented in a Unicode encoding
whereas to store them in standard strings we need them in our local
encoding. Xerces provides a static <tt class=
"function">XMLString::transcode()</tt> function to do this
conversion. The conversion could be automated by building it into
the insertion operator for the type <span class=
"type">XMLCh</span>.</p>
<p>However, <span class="type">XMLCh</span> is a typedef from short
which makes it difficult - you can't overload based on a typedef
because typedef does not create new type, but simply an alias.
Therefore the standard inserter for short will be used by the
compiler instead. To get around this problem there are a couple of
alternatives: Use a different function to insert in the stream
(rather than <tt class="methodname">operator&lt;&lt;</tt>) or
explicitly translate the encoding before inserting in the
stream.</p>
<p>Here is the function <tt class="function">write_xml</tt> used
earlier to transcode and insert XML characters into a stream:</p>
<pre class="programlisting">
void write_xml(std::ostream&amp; target,
               const XMLCh* s) {
  char *p = XMLString::transcode(s);
  target &lt;&lt; p;
  delete [] p;
}
</pre>
<p>To avoid the call to <tt class="literal">delete[]</tt> you could
replace the <span class="type">char*</span> with a smart pointer
capable of holding and deleting an array (unlike <tt class=
"classname">std::auto_ptr</tt>). If <tt class=
"literal">target&lt;&lt;p</tt> could throw, for example, this would
be necessary to make the function exception-safe.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e236" id="d0e236"></a>Using the SAX
processor</h2>
</div>
<p>With the handler class in place and now using XPath style
element names, we can rewrite the parsing code. First, add a helper
function to make it easier to add an element to the map:</p>
<pre class="programlisting">
template&lt;typename T&gt;
void AddElement(ElementMap_t&amp; map,
                T* ptr,
                const std::string&amp; path) {
  map.insert(std::make_pair
             (path, new ElementData&lt;T&gt;(ptr)));
}
</pre>
<p>The final code looks like this:</p>
<pre class="programlisting">
char filename[]=&quot;file.xml&quot;
ElementMap_t element_map;

AddElement(element_map, &amp;FirstName,
  &quot;/Person/FirstName&quot;);
AddElement(element_map, &amp;LastName,
  &quot;/Person/LastName&quot;);
AddElement(element_map, &amp;year,
  &quot;/Person/DateOfBirth/Year&quot;);
AddElement(element_map, &amp;month,
  &quot;/Person/DateOfBirth/Month&quot;);
AddElement(element_map, &amp;day,
  &quot;/Person/DateOfBirth/Day&quot;);

MySaxHandler handler(element_map);
parser.setDocumentHandler(&amp;handler);
parser.parse(filename);
</pre></div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e247" id="d0e247"></a>Further
refinement</h2>
</div>
<p>An obvious enhancement is to enable multiple occurrences of the
same element name in the XML document to store data in
corresponding multiple variables, something which is not catered
for by the code I have presented here.</p>
<p>Another way in which this design could be extended would be to
support multiple XML document types. Since the element map objects
contain full XPath names for each element, the same map could be
used, and it would continue to uniquely identify each element as it
is discovered in any known XML document type.</p>
</div>
<div class="bibliography">
<div class="titlepage">
<h2><a name="d0e254" id="d0e254"></a>Further
Reading</h2>
</div>
<div class="bibliomixed">
<p class="bibliomixed">XML: <span class="bibliomisc"><a href=
"http://www.w3c.org/xml" target=
"_top">http://www.w3c.org/xml</a></span></p>
</div>
<div class="bibliomixed">
<p class="bibliomixed">Xpath: <span class="bibliomisc"><a href=
"http://www.w3c.org/TR/xpath" target=
"_top">http://www.w3c.org/TR/xpath</a></span></p>
</div>
<div class="bibliomixed">
<p class="bibliomixed">Xerces XML parser: <span class=
"bibliomisc"><a href="http://xml.apache.org/xerces-c" target=
"_top">http://xml.apache.org/xerces-c</a></span></p>
</div>
<div class="bibliomixed">
<p class="bibliomixed">Tim Pushman, The SAX Parser, <span class=
"citetitle"><i class="citetitle">C Vu</i></span>, December 2002</p>
</div>
</div>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
