    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: XML Parsing with the Document Object Model</title>
        <link>https://members.accu.org/index.php/journals/1203</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>


        <h2>Journal Articles</h2>


<div class="xar-mod-head"><span class="xar-mod-title">CVu Journal Vol 14, #5 - Oct 2002 + Programming Topics</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/journals/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c77/">CVu</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c112/">145</a>
                    (10)
<br />

                                            <a href="https://members.accu.org/index.php/journals/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c65/">Programming</a>
                    (877)
<br />

                                            <a href="https://members.accu.org/index.php/journals/c112-65/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/journals/c112+65/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;XML Parsing with the Document Object Model</h1>
<p><strong>Author:</strong>&nbsp;</p>
<p>
<strong>Date:</strong> 03 October 2002 13:15:55 +01:00 or Thu, 03 October 2002 13:15:55 +01:00</p>
<p><strong>Summary:</strong>&nbsp;</p>
<p><strong>Body:</strong>&nbsp;<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e18" id="d0e18"></a>What is the
DOM?</h2>
</div>
<p>Following Tim Pushman's article on parsing XML using SAX, I will
describe here the principles and details of the Document Object
Model, which defines a standard way to model an XML or HTML object.
It describes data structures with standard names and behaviours,
and standard functions to access the data. Most XML parsers support
the D4OM and will parse any well-formed XML document into a DOM
structure for you.</p>
<p>But first a bit of history... Long ago when web pages were
mainly static text with a few images here and there, the writers of
the two main browsers (Netscape and Internet Explorer of course)
came up with what they called Dynamic HTML, or DHTML for short.
DHTML allowed web pages to access their own content, and to change
it according to users' actions. This allowed people to write web
pages with images that changed when the user moved the mouse over
them, and other fancy effects that they thought would attract more
people to their web sites.</p>
<p>Both browsers achieved this by adding a scripting to capability
to the dialect of HTML that they understood. The two scripting
languages were similar but differed in many ways. They were called
Javascript and Jscript. Neither is remotely related to Java and the
two have now been unified and standardised into ECMAScript <a href=
"#ecma-script">[ecma-script]</a>.</p>
<p>In order that a script embedded in an HTML document could have
something to work on, a model was needed, through which the script
could access the various elements on the web page. The model that
was created modelled the actual HTML document itself, so became
known as the Document Object Model, or DOM for short. The DOM is
now a W3C standard <a href="#dom-std">[dom-std]</a> and comes in
two slightly different alternatives, for XML or HTML respectively.
Although, as we have seen, the DOM had its origins in HTML, the
HTML version can be thought of as a slightly specialised version of
the XML DOM, and since this is a series of articles on XML we will
concentrate on that one here. You can read more about the HTML DOM
in the official W3C recommendation <a href=
"#html-dom">[html-dom]</a>.</p>
<p>The model described by the DOM can be thought of as a tree-like
structure. The tree is made up of a number of different object-like
nodes. A node can be any one of a number of different node types,
which are effectively data types derived from the basic Node type.
In fact the standard does not specify that nodes in the DOM have to
be implemented as objects at all, merely that they behave like
objects. This allows DOM implementations in non object-oriented
languages like C.</p>
<p>In order to get an XML document into a DOM tree you need an XML
parser that supports the DOM (most do). You can then parse existing
XML documents and create a DOM structure, or create one from
scratch. In this article we shall see how to do this, concentrating
mostly on the Apache Xerces C++ parser, although the principles
apply generally.</p>
<p>The root node of the tree is the document itself. Since this is
a tree structure there can only be one root, and that ties in with
the concept of an XML document, which can have only one root-level
element (the &quot;document element&quot;). The other main types of node in
the tree are Elements, Attributes, and text. In keeping with
well-formed XML, the document node contains only one element, while
elements can contain other elements, attributes, or text:</p>
<div class="c2"><img src=
"resources/xml%20parser%20element%20structure.png" align=
"middle"></div>
<p>This is a good place to explain that, although the DOM defines
all these types of node, you don't have to use them. There are
actually two interfaces to a document via the DOM: through generic
&quot;node&quot; objects (which the W3C recommendation describes as &quot;the
primary datatype for the Document Object Model&quot;) or through more
concrete derived types &quot;Element&quot; objects, &quot;Attribute&quot; objects, and
so on. This means that all the different node types in the diagram
could be labelled &quot;node&quot;, and accessed through the DOM Node
interface. Each DOM Node has an attribute (called NodeType) that
indicates what kind of node it actually is.</p>
<p>Let's look at an example. Take the following piece of XML (yes,
it's the familiar hypothetical phone book!):</p>
<pre class="programlisting">
&lt;?xml version='1.0'?&gt; 
&lt;PhoneList&gt; 
  &lt;Entry type=&quot;external&quot;&gt; 
    &lt;name&gt;John&lt;/name&gt; 
    &lt;number&gt;123456&lt;/number&gt; 
  &lt;/Entry&gt; 
  &lt;Entry type=&quot;external&quot;&gt; 
    &lt;name&gt;Jane&lt;/name&gt; 
    &lt;number&gt;7654321&lt;/number&gt; 
  &lt;/Entry&gt; 
  &lt;Entry type=&quot;internal&quot;&gt; 
    &lt;name&gt;Fred&lt;/name&gt; 
    &lt;number&gt;100&lt;/number&gt; 
  &lt;/Entry&gt; 
&lt;/PhoneList&gt;
</pre>
<p>The DOM structure of this document would look something like
this:</p>
<div class="c2"><img src=
"resources/xml%20parser%20dom%20structure.png" align=
"middle"></div>
<p>Once parsed into a DOM tree (we will see how to do this in a
minute) you can access the XML document using the DOM Document
interface. This has a documentElement attribute to access the
single document element (<tt class="literal">PhoneList</tt> in this
case). As an example of the methods provided by the DOM, you could
find its child elements with a specific name using the <tt class=
"literal">GetElementsByTagName</tt> method.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e62" id="d0e62"></a>Parsing</h2>
</div>
<p>Different XML parsers and processors have different ways to
initiate the parse process that builds the DOM tree. The examples
shown here all use the Xerces parser <a href="#xerces">[xerces]</a>
from Apache. This parser is closely related to the IBM XML4C parser
(IBM donated an early version to Apache, and their subsequent
versions are based on Xerces).</p>
<p>To do the parse we need a DOMParser object. We then call the
<tt class="literal">parse()</tt> method, giving it a name of an XML
file. The program below shows a minimal Xerces DOM program (Xerces
has two sides to its personality, also supporting the Simple API
for XML, SAX, but this article ignores this face of Xerces), all
error checking and exception catching have been omitted for the
usual space and clarity reasons (but see later for details):</p>
<pre class="programlisting">
#include &lt;util/PlatformUtils.hpp&gt; 
#include &lt;parsers/DOMParser.hpp&gt; 
#include &lt;string&gt; 
#include &lt;iostream&gt; 
int main() { 
  // Initialise the XML processor 
  XMLPlatformUtils::Initialize(); 
  std::cout&lt;&lt;&quot;Enter the name of an XML file:&quot;; 
  std::string filename; 
  std::cin&gt;&gt;filename; 
  DOMParser parser; 
  parser.parse(filename.c_str()); 
}
</pre></div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e77" id="d0e77"></a>Reading
Elements and Text</h2>
</div>
<p>The following code fragment reads all the <tt class=
"literal">&lt;Entry&gt;</tt> elements from the previous XML
document:</p>
<pre class="programlisting">
DOM_Document phonelist=parser.getDocument(); 

// Get the &lt;PhoneList&gt; element 
DOM_Element root = phonelist.getDocumentElement(); 

// Get all &lt;Entry&gt; elements into a DOM &quot;nodelist&quot; structure 
DOM_NodeList entries = root.getElementsByTagName(&quot;Entry&quot;); 

// Now iterate through the nodelist, processing each element found there 
for (unsigned long node_index=0; node_index&lt; entries.getLength(); node_index++) { 
  //Deal with entry 
}
</pre>
<p>You will notice that the <tt class=
"literal">getElementsByTagName</tt> method provides us with a
<tt class="literal">NodeList</tt> rather than an <tt class=
"literal">ElementList</tt>. This is because the DOM defines a list
of nodes (which are all elements in this case) but not a list of
elements as such. We have to deal with a base node object rather
than a concrete Element object. More about that in a moment, but
first let's see how we extract the name and number from those
entries using the DOM <tt class="literal">firstChild</tt> and
<tt class="literal">nextSibling</tt> attributes:</p>
<pre class="programlisting">
for (unsigned long node_index=0; node_index&lt; entries.getLength(); node_index++) { 
  // Get each &lt;Entry&gt; element from the list 
  DOM_Node node=entries.item(node_index); 
  // Get the &lt;name&gt; node - the next sibling of &lt;Entry&gt;'s first child node 
  DOM_Node name_node = node.getFirstChild().getNextSibling(); 
  // Get the text node - a child of the &lt;name&gt; node 
  DOM_Node name_text = name_node.getFirstChild(); 
  // Finally get the actual text 
  DOMString name=name_text.getNodeValue(); 

  // Repeat for &lt;number&gt; 
  DOM_Node number_node = name_node.getNextSibling(). getNextSibling(); 
  DOM_Node number_text = number_node.getFirstChild(); 
  DOMString number=number_text.getNodeValue(); 

  // Process name and number, remembering they are Unicode strings encoded in UTF-16, 
  // whatever the XML declaration of this particular document says 
}
</pre>
<p>The above code shows that, although <tt class=
"literal">firstChild</tt> and <tt class="literal">nextSibling</tt>
are attributes (not methods) of a DOM <tt class=
"literal">node</tt>, in this particular DOM parser (Xerces) they
are accessed via member functions <tt class=
"literal">getFirstChild()</tt> and <tt class=
"literal">getNextSibling()</tt>. This is because the DOM specifies
its interfaces as IDL (as used by CORBA [similar to that used by
COM] ) and does not specify the technology to be used to implement
these interfaces. The Xerces Parser chooses to define objects with
<tt class="literal">get</tt>/<tt class="literal">set</tt> functions
to represent attributes - like COM - hence the names used. This is
implementation-defined though, and another parser may well access
attributes of DOM objects as simple member variables.</p>
<p>As mentioned earlier the DOM <tt class="literal">Node</tt>
interface is an alternative to the individual node-type interfaces.
In as much as each type of node &quot;is a&quot; DOM Node, we have an
inheritance relationship, easily implemented using a base
<tt class="literal">Node</tt> class and derived <tt class=
"literal">Document</tt> class, <tt class="literal">Element</tt>
class, and so on. However Xerces chooses not to implement it like
that. For whatever reason (probably related to the fact that Xerces
was implemented in Java before C++) you can only convert an Xerces
<tt class="literal">DOM_Node</tt> type to a <tt class=
"literal">DOM_Element</tt> by using a <tt class=
"literal">static_cast</tt>:</p>
<pre class="programlisting">
DOM_Node some_node; 
DOM_Element elem; 
/*...assign some_node...*/ 
elem = dynamic_cast&lt;DOM_Element&amp;&gt;(some_node);  // WRONG! 
if(some_node.getNodeType() == DOM_Node::ELEMENT_NODE) 
  elem = static_cast&lt;DOM_Element&amp;&gt;(some_node); // OK!
</pre>
<p>This means you must check first whether the <tt class=
"literal">DOM_Node</tt> really does represent an element, and not a
Text node for example. This must be done using the <tt class=
"literal">DOM_Node::getNodeType</tt> member function. If you try to
&quot;down-cast&quot; a Xerces DOM node to the wrong kind of derived type
beware - you won't get a bad_cast exception, and you will almost
ceratinly get a crash if you try to call member functions on the
wrong node type. Note that this is a detail of the Xerces
implementation and other implementations may well use &quot;real&quot; C++
hierarchies in which this restriction does not apply.</p>
<p>You can also see in this case that within an &lt;Entry&gt; node,
the &lt;name&gt; node is not the first child, but the sibling of
the first child. Similarly the &lt;number&gt; node is not
&lt;name&gt;'s next sibling, but the sibling of that sibling node.
This is because the first child of &lt;Entry&gt; in our XML
document is a text node holding the End-of-Line character(s), as is
the first sibling node of &lt;name&gt;. Whitespace counts! Some
parsers can be set to ignore these nodes but you need to be aware
that they exist, and allow for them if necessary.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e164" id="d0e164"></a>The DOMString
Interface</h2>
</div>
<p>The document object model defines a string type to hold
character data, which in Xerces is called <tt class=
"literal">DOMString</tt>, as you can see above. The <tt class=
"literal">DOMString</tt> class can <span class=
"emphasis"><em>not</em></span> be directly output using normal
<tt class="literal">ostream</tt> inserter operators since it holds
Unicode characters encoded in UTF-16. As the comment at the end of
the code extract above says, this applies whatever encoding your
XML document uses, so you need to convert it to a suitable
representation if you need to output it. In the case of Xerces a
static member function called <tt class="literal">transcode</tt> is
provided that returns you a pointer to an allocated buffer
containing the string's local representation. This means you can
implement an <tt class="literal">ostream</tt> inserter like
this:</p>
<pre class="programlisting">
std::ostream&amp; operator&lt;&lt;(std::ostream&amp; target, const DOMString&amp; s) { 
  char *p = s.transcode(); 
  target &lt;&lt; p; 
  delete [] p; 
  return target; 
}
</pre>
<p>This is not ideal, since you are responsible for deleting the
memory allocated by the transcode function, and problems can occur
if (for instance) it was allocated from a different heap - as may
be the case in Windows debug environments, but it suffices for our
illustration. See <a href="#xml-memory">[xml-memory]</a> for more
details.</p>
<p>Other implementations will use different ways of transforming
between different character encodings, which you will have to use
since DOMString is mandated to use UTF-16 internally. Note the IBM
version of Xerces, XML4C, has more Unicode support and may be worth
looking into for those who need it.</p>
<p>You will get <tt class="literal">DOM_String</tt> objects from
any methods that allow you to access text content. We have already
seen the general <tt class="literal">getNodeName()</tt> and
<tt class="literal">getNodeValue()</tt> methods of the <tt class=
"literal">DOM_Node</tt> interface, which are general calls whose
return values depend on the node type, giving for example the text
content of a text node, the name of an element,. There are also
methods on specific node types that return text in the form of
<tt class="literal">DOM_String</tt> objects:</p>
<pre class="programlisting">
DOM_Element::getTagName() 
DOM_Element::getAttribute(name) 
DOM_Attribute::getName() 
DOM_Attribute::getValue() 
DOM_CharacterData::getData()
</pre>
<p>And so on. The <tt class="literal">DOM_CharacterData</tt> class
is actually a base class of two other types of node, the Comment
node and the Text node. Actual DOM structures will never contain
<tt class="literal">CharacterData</tt> nodes. You can get to the
text data of both of the derived node types using the base class
call to <tt class="literal">getData</tt>.</p>
<p>Note also that it is not guaranteed that a single block of text
is contained in a single DOM Text object. When parsing, some
implementations will break text into a number of Text objects split
at line breaks, entity references or other places.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e228" id=
"d0e228"></a>Attributes</h2>
</div>
<p>There are two ways to get at attributes on XML Elements using
the DOM. First, using the node interface, we can access the
attribute objects attached to an element: Each &lt;Entry&gt;
element in the example above has an attribute called <tt class=
"literal">type</tt>. In the DOM, an attribute is <span class=
"emphasis"><em>not</em></span> a child node of the element it
applies to (bizarrely though the parent of an attribute
<span class="emphasis"><em>is</em></span> that element). Suppose we
are only interested in those entries whose <tt class=
"literal">type</tt> attribute has a value of &quot;external&quot;. We can
examine the <tt class="literal">type</tt> attribute on each
iteration and only process those elements that have the required
value. As already mentioned, the DOM does not insert attributes as
child nodes of elements, rather it makes available for each element
a list of its attributes. This list is of DOM type <tt class=
"literal">NamedNodeMap</tt> and works like a C++ <tt class=
"literal">std::map</tt> for accessing nodes by name:</p>
<pre class="programlisting">
// Find element with attribute `type=&quot;internal&quot;` in list of elements `entries' 
for(unsigned long node_index=0; node_index&lt;entries.getLength(); node_index++) { 
  DOM_Node node=entries.item(node_index); 
  DOM_NamedNodeMap attributes = node.getAttributes(); 
  // Are there any attributes? 
  if (attributes!=0) { 
    DOM_Node attr = attributes.getNamedItem(&quot;type&quot;); 
    // Is there a &quot;type&quot; attribute of the // right value? 
    if ((attr!=0) &amp;&amp; (attr.getNodeValue() .equals(&quot;internal&quot;))) { 
      //Phew, got there! 
      //Process this &lt;Entry&gt; node..
     } 
  } 
}
</pre>
<p>This method is useful if we don't know what attributes there may
be, or if we want to get the attribute as a node object for some
reason. There is a quicker way of accessing a named attribute
directly:</p>
<pre class="programlisting">
DOM_String type_attribute = element.getAttribute(&quot;type&quot;); 
if (type_attribute.equals(&quot;internal&quot;) { 
  //process 
}
</pre></div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e260" id="d0e260"></a>Creating XML
using the DOM</h2>
</div>
<p>We have seen how to parse XML using the DOM, now lets see how we
can use the Docunent Object Model to build a representation of our
own XML document.</p>
<p>You can only add XML to an existing document or document
fragment, so first we need to create a new document. The Document
Object Model does not actually define how implementations create an
initial document object (which is of course a type derived from
<tt class="literal">DOM_Node</tt>). In Xerces this is done using a
method of an instance of the <tt class=
"literal">DOM_DOM_Implementation</tt> class:</p>
<pre class="programlisting">
DOM_DOMImplementation impl; 
DOM_Document doc = impl.createDocument(0,                   // Namespace URI if required 
                                       &quot;PhoneList&quot;,         // Root element name 
                                       DOM_DocumentType()); // Default document type object
</pre>
<p>The constructor of <tt class="literal">DOM_Document</tt> does
not create you a valid XML document node, only a &quot;shell&quot; which must
be filled in by assigning the result of the <tt class=
"literal">createDocument()</tt> function as shown above. The only
thing you can do to this object is assign to it. Once this has been
done, the document object will contain a single child, the root
element node of the document.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e283" id="d0e283"></a>Adding
Elements</h2>
</div>
<p>To add elements to any exsisting element, including the root
document element we need to use the <tt class=
"literal">createElement()</tt> and <tt class=
"literal">createTextNode()</tt> members of <tt class=
"literal">DOM_Document</tt> to create new nodes. Unlike creating a
new document, once we have a document the DOM does define these
methods as a way to create new nodes of varying types. Then we
append the new nodes as children:</p>
<pre class="programlisting">
// Create an &lt;Entry&gt; 
DOM_Element Entry = doc.createElement(&quot;Entry&quot;); 
// Create a &lt;name&gt; 
DOM_Element Name = doc.createElement(&quot;name&quot;); 
// Create a &lt;number&gt; 
DOM_Element Number = doc.createElement(&quot;number&quot;); 
// Add the &lt;name&gt; to the &lt;Entry&gt; 
Entry.appendChild(Name); 
// Append the &lt;number&gt; 
Entry.appendChild(Number); 
// Get the document element from the document
 DOM_Element DocumentElement = doc.getDocumentElement(); 
// Add the composite &lt;Entry&gt; element to it 
DocumentElement.appendChild(Entry); 
// Create a text node holding the string &quot;Bond&quot; 
DOM_Text NameValue = doc.createTextNode(&quot;Bond&quot;); 
// Append the text to the &lt;name&gt; element 
Name.appendChild(NameValue); 
// Do the same for &lt;number&gt; 
DOM_Text NumberValue = doc.createTextNode(&quot;007&quot;); 
Number.appendChild(NumberValue);
</pre></div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e299" id="d0e299"></a>Adding
Attributes</h2>
</div>
<p>To add the &quot;type&quot; attribute to the &lt;Entry&gt; element we
could create an attribute object, but it is easier to just call the
setAttribute() method on the element:</p>
<pre class="programlisting">
Entry.setAttribute(&quot;type&quot;, &quot;secret&quot;);
</pre>
<p>This member of <tt class="literal">DOM_Element</tt> makes it
easy to set attributes but at the expense of a little flexibility.
The text passed to the function must be correctly encoded with no
entity references. You can alternatively create an Attribute node
in a similar way to how the Element nodes were created, add Text
and any Entity Reference nodes to it, and call the <tt class=
"literal">setAttributeNode()</tt> method of the <tt class=
"literal">Element</tt> interface:</p>
<pre class="programlisting">
DOM_Attr Attr=doc.createAttribute(&quot;type&quot;); 
Attr.setValue(&quot;secret&quot;); 
// Create any Entity references in the attribute text and add them too... 
// ...now add the attribute to the element 
Entry.setAttributeNode(Attr);
</pre>
<p>If you want to create an attribute node, and the text doesn't
contain any entity references, you can use the <tt class=
"literal">setValue(&quot;text&quot;)</tt> method of the <tt class=
"literal">Attr</tt> interface instead of adding text nodes to
it.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e327" id="d0e327"></a>Outputting
XML</h2>
</div>
<p>Now we have built this tree in memory, we need to output it. How
you do so depends on the parser. Xerces does not provide a standard
ostream inserter for a <tt class="literal">DOMString</tt>, so we
will have to provide one ourselves. This is most likely because, as
discussed above, the character data in a <tt class=
"literal">DOMString</tt> object is stored in UTF-16 encoding, which
is normally not what we output to plain text files.</p>
<p>This means that to output the contents of a DOM Node to an
<tt class="literal">ostream</tt> we need do a couple of things:</p>
<p>1. Establish what kind of node it is (document, element, text
and so on) and act accordingly</p>
<p>2. Convert text to the appropriate encoding used by the current
locale.</p>
<p>The first can easily be done using a <tt class=
"literal">switch</tt> statement and involves outputting appropriate
text for the different node types. Some nodes (e.g. document,
element) contain other nodes so they would use recursive calls to
the output function until the whole sub-tree had been dealt
with:</p>
<pre class="programlisting">
ostream&amp; operator&lt;&lt;(std::ostream&amp; s, DOM_Node&amp; node) { 
  switch (node.getNodeType()) { 
    case DOM_Node::TEXT_NODE: 
      s&lt;&lt; node.getNodeValue(); 
      break; 
    case DOM_Node::ELEMENT_NODE: 
      s&lt;&lt; `&lt;' &lt;&lt; node.getNodeName(); 
      /*..deal with attributes..*/ 
      s&lt;&lt; `&gt;'; 
      /*..deal with child nodes recursively..*/ 
      s&lt;&lt; `&lt;/' &lt;&lt; node.getNodeName() &lt;&lt; `&gt;'; 
      break; 
    /*.. and so on ..*/ 
  }
  return s; 
}
</pre>
<p>A full example can be seen in the Xerces sample program
&quot;DOMPrint.cpp&quot;.</p>
<p>This deals with the node type, but an additional <tt class=
"literal">operator&lt;&lt;</tt> is needed to actually output the
contents of DOM String variables, and will be called by the
function above. It needs to expand the special characters &amp;,
&lt;, &gt;, `, and &quot; into predefined entities and send the contents
of the modified string to the stream. I will leave that as an
exercise!</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e361" id=
"d0e361"></a>Validation</h2>
</div>
<p>Any XML parser worth its salt should be able to validate the XML
we pass it. There are two requirements we could have checked for
us:</p>
<p>1. Is the XML well-formed XML (ie. does it conform to the W3C
XML specification)?</p>
<p>2. Is the XML valid (does it conform to its DTD)?</p>
<p>The first requirement is the most basic - any XML we parse
should be well- formed or we can't call it XML. The Xerces parser
has two options and will either throw an exception or call a
user-supplied error handler if it finds any irregularity in the XML
it is parsing.</p>
<p>The error handler is an object of some class derived from
<tt class="literal">HandlerBase</tt>. There are three severities of
error, the parser will call the appropriately named method of your
class, passing a reference to a <tt class=
"literal">SAXException</tt> object describing the error.</p>
<p>The second requirement depends on us having a DTD for the XML we
are parsing. So far, the XML we have seen has been anonymous - that
is with no document type specification. Let's look at the DTD for
our example phone list:</p>
<pre class="programlisting">
&lt;!ELEMENT PhoneList (Entry*)&gt; 
&lt;!ELEMENT Entry (name,number)&gt; 
&lt;!ATTLIST Entry type CDATA #REQUIRED&gt; 
&lt;!ELEMENT name (#PCDATA)&gt; 
&lt;!ELEMENT number (#PCDATA)&gt;
</pre>
<p>We will assume that this exists in a file called PhoneList.dtd.
We can refer to this in out phonelist.xml file with a DOCTYPE
declaration:</p>
<pre class="programlisting">
&lt;!DOCTYPE PhoneList SYSTEM &quot;PhoneList.dtd&quot; &gt;
</pre>
<p>If our parser is validating the XML (with Xerces this is enabled
by calling the <tt class="literal">DOM_Parser</tt> member
<tt class="literal">setDoValidation()</tt>, before parsing, with
the value <tt class="literal">true</tt>) it will compare the XML we
parse with the DTD given. Some parser-specific action will take
place to report the errors, for instance with Xerces the error
handler (previously specified, or a default) will be called, and
the parse aborted.</p>
<p>Let's modify our parsing code accordingly:</p>
<pre class="programlisting">
//First an error handler, derive from Xerces HandlerBase 
class MyErrorHandler : public HandlerBase { 
public: 
  void error(const SAXParseException &amp;e) { 
    std::cerr &lt;&lt; &quot;ERROR at line&quot; &lt;&lt; e.getLineNumber()&lt;&lt;std::endl; 
  } 
  void fatalError(const SAXParseException &amp;e) { 
    std::cerr &lt;&lt; &quot;FATAL ERROR at line &quot; &lt;&lt; e.getLineNumber()&lt;&lt;std::endl; 
  } 
  void warning(const SAXParseException &amp;e) { 
    std::cerr &lt;&lt; &quot;WARNING at line &quot; &lt;&lt; e.getLineNumber() &lt;&lt; std::endl; 
  } 
}; 
... 
//Now, tell the parser to validate the XML 
parser.setDoValidation(true); 
try { 
  //Create an error handler 
  MyErrorHandler error_handler; 
  //And tell the parser to use it 
  parser.setErrorHandler(&amp;error_handler); 
  //Now parse the file 
  parser.parse(filename.c_str()); 
} catch (const XMLException&amp; e) { 
  std::cerr &lt;&lt; &quot;An exception occured during parsing\n Message: &quot; 
            &lt;&lt; DOMString(e.getMessage()) 
            &lt;&lt; std::endl; 
}
</pre>
<p>The error handler method <tt class="literal">error</tt> will be
called in the event that the XML does not conform to the DTD. The
<tt class="literal">fatalError</tt> method will be called in the
event of ill-formed XML, implying that the parse cannot
continue.</p>
<p>The catch block will catch any internal errors that occur during
the parse, or exceptions thrown from within your custom error
handler.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e413" id="d0e413"></a>DOM
Exceptions</h2>
</div>
<p>You might have noticed that the exception objects passed to the
error handler above were objects of type <tt class=
"literal">SAXParseException</tt>, not <tt class=
"literal">DOM_Exception</tt>. The Xerces parser uses objects of
this class to encapsulate general parsing errors. I would guess
that this could be because Xerces uses SAX internally when parsing
an XML document into a DOM structure. The Document object model
does have its own <tt class="literal">Exception</tt> class that is
supposed to be thrown under various error conditions - the W3C
recommendation states &quot;when an operation is impossible to perform&quot;
but allows implementations to use &quot;native error reporting
mechanisms&quot; if exceptions are not supported. It also says that
general DOM methods return specific error values rather than throw
exceptions.</p>
<p>Xerces will throw <tt class="literal">DOM_DOMException</tt>
objects when you are manipulating DOM data structures or creating
them from scratch. For example, you will get a <tt class=
"literal">DOM_DOMException</tt> if you attempt to substring a
<tt class="literal">DOMString</tt> object with too high an
offset.</p>
<p>This means that you should be prepared to catch a <tt class=
"literal">DOM_Exception</tt> object (but not during a parse, at
least with Xerces) but it probably means a serious problem rather
than a simple error like an ill-formed XML document.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e443" id="d0e443"></a>XML
Namespaces</h2>
</div>
<p>The DOM level 2 (the latest version) includes support for XML
namespaces. In practice this means that, in addition to the
standard DOM accessor functions we have seen there are some new
ones that allow you to access namespace features.</p>
<p>For example:</p>
<pre class="programlisting">
// Retrieve the identifier of the namespace a node belongs to 
DOMNode::GetNamespaceURI() 

// Retrieve the namespace prefix of a node 
DOMNode::GetPrefix() 

// Retrieve just the name part of the node name (omitting the namespace prefix) 
DOMNode::GetLocalName() 

// Retrieve a list of attributes with the given name, 
// and the given namespace identifier (&quot;*&quot; means all namespaces). 
DOMElement::GetAttributeNodeNS(const DOMString&amp; namespaceURI, const DOMString&amp; localname);
</pre>
<p>You can choose whether or not you want to use the namespace
features of the DOM and stick to the appropriate set of method
calls - those that support namespaces or those that do not.</p>
<div class="bibliography">
<div class="titlepage">
<h2><a name="d0e454" id="d0e454"></a>References</h2>
</div>
<div class="bibliomixed"><a name="ecma-script" id=
"ecma-script"></a>
<p class="bibliomixed">[ecma-script] <span class=
"bibliomisc"><a href="http://www.ecma.ch/ecma1/STAND/ECMA-262.HTM"
target=
"_top">http://www.ecma.ch/ecma1/STAND/ECMA-262.HTM</a></span></p>
</div>
<div class="bibliomixed"><a name="dom-std" id="dom-std"></a>
<p class="bibliomixed">[dom-std] <span class="bibliomisc"><a href=
"http://www.w3.org/TR/REC-DOM-Level-1%20and%20http://www.w3.org/TR/DOM-Level2-Core"
target="_top">http://www.w3.org/TR/REC-DOM-Level-1 and
http://www.w3.org/TR/DOM-Level2-Core</a></span></p>
</div>
<div class="bibliomixed"><a name="html-dom" id="html-dom"></a>
<p class="bibliomixed">[html-dom] <span class="bibliomisc"><a href=
"http://www.w3.org/TR/REC-DOM-Level-1/level-one-%20html.html"
target="_top">http://www.w3.org/TR/REC-DOM-Level-1/level-one-
html.html</a></span></p>
</div>
<div class="bibliomixed"><a name="xerces" id="xerces"></a>
<p class="bibliomixed">[xerces] <span class="bibliomisc"><a href=
"http://xml.apache.org/xerces-c" target=
"_top">http://xml.apache.org/xerces-c</a></span></p>
</div>
<div class="bibliomixed"><a name="xml-memory" id="xml-memory"></a>
<p class="bibliomixed">[xml-memory] <span class=
"bibliomisc"><a href="http://www.goingware.com/tips/xmlmemory.html"
target=
"_top">http://www.goingware.com/tips/xmlmemory.html</a></span></p>
</div>
</div>
</div>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
