Title: XML Parsing with the Document Object Model

Author:

Date: 03 October 2002 13:15:55 +01:00 or Thu, 03 October 2002 13:15:55 +01:00

Summary:

Body:

What is the DOM?

Following Tim Pushman's article on parsing XML using SAX, I will describe here the principles and details of the Document Object Model, which defines a standard way to model an XML or HTML object. It describes data structures with standard names and behaviours, and standard functions to access the data. Most XML parsers support the D4OM and will parse any well-formed XML document into a DOM structure for you.

But first a bit of history... Long ago when web pages were mainly static text with a few images here and there, the writers of the two main browsers (Netscape and Internet Explorer of course) came up with what they called Dynamic HTML, or DHTML for short. DHTML allowed web pages to access their own content, and to change it according to users' actions. This allowed people to write web pages with images that changed when the user moved the mouse over them, and other fancy effects that they thought would attract more people to their web sites.

Both browsers achieved this by adding a scripting to capability to the dialect of HTML that they understood. The two scripting languages were similar but differed in many ways. They were called Javascript and Jscript. Neither is remotely related to Java and the two have now been unified and standardised into ECMAScript [ecma-script].

In order that a script embedded in an HTML document could have something to work on, a model was needed, through which the script could access the various elements on the web page. The model that was created modelled the actual HTML document itself, so became known as the Document Object Model, or DOM for short. The DOM is now a W3C standard [dom-std] and comes in two slightly different alternatives, for XML or HTML respectively. Although, as we have seen, the DOM had its origins in HTML, the HTML version can be thought of as a slightly specialised version of the XML DOM, and since this is a series of articles on XML we will concentrate on that one here. You can read more about the HTML DOM in the official W3C recommendation [html-dom].

The model described by the DOM can be thought of as a tree-like structure. The tree is made up of a number of different object-like nodes. A node can be any one of a number of different node types, which are effectively data types derived from the basic Node type. In fact the standard does not specify that nodes in the DOM have to be implemented as objects at all, merely that they behave like objects. This allows DOM implementations in non object-oriented languages like C.

In order to get an XML document into a DOM tree you need an XML parser that supports the DOM (most do). You can then parse existing XML documents and create a DOM structure, or create one from scratch. In this article we shall see how to do this, concentrating mostly on the Apache Xerces C++ parser, although the principles apply generally.

The root node of the tree is the document itself. Since this is a tree structure there can only be one root, and that ties in with the concept of an XML document, which can have only one root-level element (the "document element"). The other main types of node in the tree are Elements, Attributes, and text. In keeping with well-formed XML, the document node contains only one element, while elements can contain other elements, attributes, or text:

This is a good place to explain that, although the DOM defines all these types of node, you don't have to use them. There are actually two interfaces to a document via the DOM: through generic "node" objects (which the W3C recommendation describes as "the primary datatype for the Document Object Model") or through more concrete derived types "Element" objects, "Attribute" objects, and so on. This means that all the different node types in the diagram could be labelled "node", and accessed through the DOM Node interface. Each DOM Node has an attribute (called NodeType) that indicates what kind of node it actually is.

Let's look at an example. Take the following piece of XML (yes, it's the familiar hypothetical phone book!):

<?xml version='1.0'?> 
<PhoneList> 
  <Entry type="external"> 
    <name>John</name> 
    <number>123456</number> 
  </Entry> 
  <Entry type="external"> 
    <name>Jane</name> 
    <number>7654321</number> 
  </Entry> 
  <Entry type="internal"> 
    <name>Fred</name> 
    <number>100</number> 
  </Entry> 
</PhoneList>

The DOM structure of this document would look something like this:

Once parsed into a DOM tree (we will see how to do this in a minute) you can access the XML document using the DOM Document interface. This has a documentElement attribute to access the single document element (PhoneList in this case). As an example of the methods provided by the DOM, you could find its child elements with a specific name using the GetElementsByTagName method.

Parsing

Different XML parsers and processors have different ways to initiate the parse process that builds the DOM tree. The examples shown here all use the Xerces parser [xerces] from Apache. This parser is closely related to the IBM XML4C parser (IBM donated an early version to Apache, and their subsequent versions are based on Xerces).

To do the parse we need a DOMParser object. We then call the parse() method, giving it a name of an XML file. The program below shows a minimal Xerces DOM program (Xerces has two sides to its personality, also supporting the Simple API for XML, SAX, but this article ignores this face of Xerces), all error checking and exception catching have been omitted for the usual space and clarity reasons (but see later for details):

#include <util/PlatformUtils.hpp> 
#include <parsers/DOMParser.hpp> 
#include <string> 
#include <iostream> 
int main() { 
  // Initialise the XML processor 
  XMLPlatformUtils::Initialize(); 
  std::cout<<"Enter the name of an XML file:"; 
  std::string filename; 
  std::cin>>filename; 
  DOMParser parser; 
  parser.parse(filename.c_str()); 
}

Reading Elements and Text

The following code fragment reads all the <Entry> elements from the previous XML document:

DOM_Document phonelist=parser.getDocument(); 

// Get the <PhoneList> element 
DOM_Element root = phonelist.getDocumentElement(); 

// Get all <Entry> elements into a DOM "nodelist" structure 
DOM_NodeList entries = root.getElementsByTagName("Entry"); 

// Now iterate through the nodelist, processing each element found there 
for (unsigned long node_index=0; node_index< entries.getLength(); node_index++) { 
  //Deal with entry 
}

You will notice that the getElementsByTagName method provides us with a NodeList rather than an ElementList. This is because the DOM defines a list of nodes (which are all elements in this case) but not a list of elements as such. We have to deal with a base node object rather than a concrete Element object. More about that in a moment, but first let's see how we extract the name and number from those entries using the DOM firstChild and nextSibling attributes:

for (unsigned long node_index=0; node_index< entries.getLength(); node_index++) { 
  // Get each <Entry> element from the list 
  DOM_Node node=entries.item(node_index); 
  // Get the <name> node - the next sibling of <Entry>'s first child node 
  DOM_Node name_node = node.getFirstChild().getNextSibling(); 
  // Get the text node - a child of the <name> node 
  DOM_Node name_text = name_node.getFirstChild(); 
  // Finally get the actual text 
  DOMString name=name_text.getNodeValue(); 

  // Repeat for <number> 
  DOM_Node number_node = name_node.getNextSibling(). getNextSibling(); 
  DOM_Node number_text = number_node.getFirstChild(); 
  DOMString number=number_text.getNodeValue(); 

  // Process name and number, remembering they are Unicode strings encoded in UTF-16, 
  // whatever the XML declaration of this particular document says 
}

The above code shows that, although firstChild and nextSibling are attributes (not methods) of a DOM node, in this particular DOM parser (Xerces) they are accessed via member functions getFirstChild() and getNextSibling(). This is because the DOM specifies its interfaces as IDL (as used by CORBA [similar to that used by COM] ) and does not specify the technology to be used to implement these interfaces. The Xerces Parser chooses to define objects with get/set functions to represent attributes - like COM - hence the names used. This is implementation-defined though, and another parser may well access attributes of DOM objects as simple member variables.

As mentioned earlier the DOM Node interface is an alternative to the individual node-type interfaces. In as much as each type of node "is a" DOM Node, we have an inheritance relationship, easily implemented using a base Node class and derived Document class, Element class, and so on. However Xerces chooses not to implement it like that. For whatever reason (probably related to the fact that Xerces was implemented in Java before C++) you can only convert an Xerces DOM_Node type to a DOM_Element by using a static_cast:

DOM_Node some_node; 
DOM_Element elem; 
/*...assign some_node...*/ 
elem = dynamic_cast<DOM_Element&>(some_node);  // WRONG! 
if(some_node.getNodeType() == DOM_Node::ELEMENT_NODE) 
  elem = static_cast<DOM_Element&>(some_node); // OK!

This means you must check first whether the DOM_Node really does represent an element, and not a Text node for example. This must be done using the DOM_Node::getNodeType member function. If you try to "down-cast" a Xerces DOM node to the wrong kind of derived type beware - you won't get a bad_cast exception, and you will almost ceratinly get a crash if you try to call member functions on the wrong node type. Note that this is a detail of the Xerces implementation and other implementations may well use "real" C++ hierarchies in which this restriction does not apply.

You can also see in this case that within an <Entry> node, the <name> node is not the first child, but the sibling of the first child. Similarly the <number> node is not <name>'s next sibling, but the sibling of that sibling node. This is because the first child of <Entry> in our XML document is a text node holding the End-of-Line character(s), as is the first sibling node of <name>. Whitespace counts! Some parsers can be set to ignore these nodes but you need to be aware that they exist, and allow for them if necessary.

The DOMString Interface

The document object model defines a string type to hold character data, which in Xerces is called DOMString, as you can see above. The DOMString class can not be directly output using normal ostream inserter operators since it holds Unicode characters encoded in UTF-16. As the comment at the end of the code extract above says, this applies whatever encoding your XML document uses, so you need to convert it to a suitable representation if you need to output it. In the case of Xerces a static member function called transcode is provided that returns you a pointer to an allocated buffer containing the string's local representation. This means you can implement an ostream inserter like this:

std::ostream& operator<<(std::ostream& target, const DOMString& s) { 
  char *p = s.transcode(); 
  target << p; 
  delete [] p; 
  return target; 
}

This is not ideal, since you are responsible for deleting the memory allocated by the transcode function, and problems can occur if (for instance) it was allocated from a different heap - as may be the case in Windows debug environments, but it suffices for our illustration. See [xml-memory] for more details.

Other implementations will use different ways of transforming between different character encodings, which you will have to use since DOMString is mandated to use UTF-16 internally. Note the IBM version of Xerces, XML4C, has more Unicode support and may be worth looking into for those who need it.

You will get DOM_String objects from any methods that allow you to access text content. We have already seen the general getNodeName() and getNodeValue() methods of the DOM_Node interface, which are general calls whose return values depend on the node type, giving for example the text content of a text node, the name of an element,. There are also methods on specific node types that return text in the form of DOM_String objects:

DOM_Element::getTagName() 
DOM_Element::getAttribute(name) 
DOM_Attribute::getName() 
DOM_Attribute::getValue() 
DOM_CharacterData::getData()

And so on. The DOM_CharacterData class is actually a base class of two other types of node, the Comment node and the Text node. Actual DOM structures will never contain CharacterData nodes. You can get to the text data of both of the derived node types using the base class call to getData.

Note also that it is not guaranteed that a single block of text is contained in a single DOM Text object. When parsing, some implementations will break text into a number of Text objects split at line breaks, entity references or other places.

Attributes

There are two ways to get at attributes on XML Elements using the DOM. First, using the node interface, we can access the attribute objects attached to an element: Each <Entry> element in the example above has an attribute called type. In the DOM, an attribute is not a child node of the element it applies to (bizarrely though the parent of an attribute is that element). Suppose we are only interested in those entries whose type attribute has a value of "external". We can examine the type attribute on each iteration and only process those elements that have the required value. As already mentioned, the DOM does not insert attributes as child nodes of elements, rather it makes available for each element a list of its attributes. This list is of DOM type NamedNodeMap and works like a C++ std::map for accessing nodes by name:

// Find element with attribute `type="internal"` in list of elements `entries' 
for(unsigned long node_index=0; node_index<entries.getLength(); node_index++) { 
  DOM_Node node=entries.item(node_index); 
  DOM_NamedNodeMap attributes = node.getAttributes(); 
  // Are there any attributes? 
  if (attributes!=0) { 
    DOM_Node attr = attributes.getNamedItem("type"); 
    // Is there a "type" attribute of the // right value? 
    if ((attr!=0) && (attr.getNodeValue() .equals("internal"))) { 
      //Phew, got there! 
      //Process this <Entry> node..
     } 
  } 
}

This method is useful if we don't know what attributes there may be, or if we want to get the attribute as a node object for some reason. There is a quicker way of accessing a named attribute directly:

DOM_String type_attribute = element.getAttribute("type"); 
if (type_attribute.equals("internal") { 
  //process 
}

Creating XML using the DOM

We have seen how to parse XML using the DOM, now lets see how we can use the Docunent Object Model to build a representation of our own XML document.

You can only add XML to an existing document or document fragment, so first we need to create a new document. The Document Object Model does not actually define how implementations create an initial document object (which is of course a type derived from DOM_Node). In Xerces this is done using a method of an instance of the DOM_DOM_Implementation class:

DOM_DOMImplementation impl; 
DOM_Document doc = impl.createDocument(0,                   // Namespace URI if required 
                                       "PhoneList",         // Root element name 
                                       DOM_DocumentType()); // Default document type object

The constructor of DOM_Document does not create you a valid XML document node, only a "shell" which must be filled in by assigning the result of the createDocument() function as shown above. The only thing you can do to this object is assign to it. Once this has been done, the document object will contain a single child, the root element node of the document.

Adding Elements

To add elements to any exsisting element, including the root document element we need to use the createElement() and createTextNode() members of DOM_Document to create new nodes. Unlike creating a new document, once we have a document the DOM does define these methods as a way to create new nodes of varying types. Then we append the new nodes as children:

// Create an <Entry> 
DOM_Element Entry = doc.createElement("Entry"); 
// Create a <name> 
DOM_Element Name = doc.createElement("name"); 
// Create a <number> 
DOM_Element Number = doc.createElement("number"); 
// Add the <name> to the <Entry> 
Entry.appendChild(Name); 
// Append the <number> 
Entry.appendChild(Number); 
// Get the document element from the document
 DOM_Element DocumentElement = doc.getDocumentElement(); 
// Add the composite <Entry> element to it 
DocumentElement.appendChild(Entry); 
// Create a text node holding the string "Bond" 
DOM_Text NameValue = doc.createTextNode("Bond"); 
// Append the text to the <name> element 
Name.appendChild(NameValue); 
// Do the same for <number> 
DOM_Text NumberValue = doc.createTextNode("007"); 
Number.appendChild(NumberValue);

Adding Attributes

To add the "type" attribute to the <Entry> element we could create an attribute object, but it is easier to just call the setAttribute() method on the element:

Entry.setAttribute("type", "secret");

This member of DOM_Element makes it easy to set attributes but at the expense of a little flexibility. The text passed to the function must be correctly encoded with no entity references. You can alternatively create an Attribute node in a similar way to how the Element nodes were created, add Text and any Entity Reference nodes to it, and call the setAttributeNode() method of the Element interface:

DOM_Attr Attr=doc.createAttribute("type"); 
Attr.setValue("secret"); 
// Create any Entity references in the attribute text and add them too... 
// ...now add the attribute to the element 
Entry.setAttributeNode(Attr);

If you want to create an attribute node, and the text doesn't contain any entity references, you can use the setValue("text") method of the Attr interface instead of adding text nodes to it.

Outputting XML

Now we have built this tree in memory, we need to output it. How you do so depends on the parser. Xerces does not provide a standard ostream inserter for a DOMString, so we will have to provide one ourselves. This is most likely because, as discussed above, the character data in a DOMString object is stored in UTF-16 encoding, which is normally not what we output to plain text files.

This means that to output the contents of a DOM Node to an ostream we need do a couple of things:

1. Establish what kind of node it is (document, element, text and so on) and act accordingly

2. Convert text to the appropriate encoding used by the current locale.

The first can easily be done using a switch statement and involves outputting appropriate text for the different node types. Some nodes (e.g. document, element) contain other nodes so they would use recursive calls to the output function until the whole sub-tree had been dealt with:

ostream& operator<<(std::ostream& s, DOM_Node& node) { 
  switch (node.getNodeType()) { 
    case DOM_Node::TEXT_NODE: 
      s<< node.getNodeValue(); 
      break; 
    case DOM_Node::ELEMENT_NODE: 
      s<< `<' << node.getNodeName(); 
      /*..deal with attributes..*/ 
      s<< `>'; 
      /*..deal with child nodes recursively..*/ 
      s<< `</' << node.getNodeName() << `>'; 
      break; 
    /*.. and so on ..*/ 
  }
  return s; 
}

A full example can be seen in the Xerces sample program "DOMPrint.cpp".

This deals with the node type, but an additional operator<< is needed to actually output the contents of DOM String variables, and will be called by the function above. It needs to expand the special characters &, <, >, `, and " into predefined entities and send the contents of the modified string to the stream. I will leave that as an exercise!

Validation

Any XML parser worth its salt should be able to validate the XML we pass it. There are two requirements we could have checked for us:

1. Is the XML well-formed XML (ie. does it conform to the W3C XML specification)?

2. Is the XML valid (does it conform to its DTD)?

The first requirement is the most basic - any XML we parse should be well- formed or we can't call it XML. The Xerces parser has two options and will either throw an exception or call a user-supplied error handler if it finds any irregularity in the XML it is parsing.

The error handler is an object of some class derived from HandlerBase. There are three severities of error, the parser will call the appropriately named method of your class, passing a reference to a SAXException object describing the error.

The second requirement depends on us having a DTD for the XML we are parsing. So far, the XML we have seen has been anonymous - that is with no document type specification. Let's look at the DTD for our example phone list:

<!ELEMENT PhoneList (Entry*)> 
<!ELEMENT Entry (name,number)> 
<!ATTLIST Entry type CDATA #REQUIRED> 
<!ELEMENT name (#PCDATA)> 
<!ELEMENT number (#PCDATA)>

We will assume that this exists in a file called PhoneList.dtd. We can refer to this in out phonelist.xml file with a DOCTYPE declaration:

<!DOCTYPE PhoneList SYSTEM "PhoneList.dtd" >

If our parser is validating the XML (with Xerces this is enabled by calling the DOM_Parser member setDoValidation(), before parsing, with the value true) it will compare the XML we parse with the DTD given. Some parser-specific action will take place to report the errors, for instance with Xerces the error handler (previously specified, or a default) will be called, and the parse aborted.

Let's modify our parsing code accordingly:

//First an error handler, derive from Xerces HandlerBase 
class MyErrorHandler : public HandlerBase { 
public: 
  void error(const SAXParseException &e) { 
    std::cerr << "ERROR at line" << e.getLineNumber()<<std::endl; 
  } 
  void fatalError(const SAXParseException &e) { 
    std::cerr << "FATAL ERROR at line " << e.getLineNumber()<<std::endl; 
  } 
  void warning(const SAXParseException &e) { 
    std::cerr << "WARNING at line " << e.getLineNumber() << std::endl; 
  } 
}; 
... 
//Now, tell the parser to validate the XML 
parser.setDoValidation(true); 
try { 
  //Create an error handler 
  MyErrorHandler error_handler; 
  //And tell the parser to use it 
  parser.setErrorHandler(&error_handler); 
  //Now parse the file 
  parser.parse(filename.c_str()); 
} catch (const XMLException& e) { 
  std::cerr << "An exception occured during parsing\n Message: " 
            << DOMString(e.getMessage()) 
            << std::endl; 
}

The error handler method error will be called in the event that the XML does not conform to the DTD. The fatalError method will be called in the event of ill-formed XML, implying that the parse cannot continue.

The catch block will catch any internal errors that occur during the parse, or exceptions thrown from within your custom error handler.

DOM Exceptions

You might have noticed that the exception objects passed to the error handler above were objects of type SAXParseException, not DOM_Exception. The Xerces parser uses objects of this class to encapsulate general parsing errors. I would guess that this could be because Xerces uses SAX internally when parsing an XML document into a DOM structure. The Document object model does have its own Exception class that is supposed to be thrown under various error conditions - the W3C recommendation states "when an operation is impossible to perform" but allows implementations to use "native error reporting mechanisms" if exceptions are not supported. It also says that general DOM methods return specific error values rather than throw exceptions.

Xerces will throw DOM_DOMException objects when you are manipulating DOM data structures or creating them from scratch. For example, you will get a DOM_DOMException if you attempt to substring a DOMString object with too high an offset.

This means that you should be prepared to catch a DOM_Exception object (but not during a parse, at least with Xerces) but it probably means a serious problem rather than a simple error like an ill-formed XML document.

XML Namespaces

The DOM level 2 (the latest version) includes support for XML namespaces. In practice this means that, in addition to the standard DOM accessor functions we have seen there are some new ones that allow you to access namespace features.

For example:

// Retrieve the identifier of the namespace a node belongs to 
DOMNode::GetNamespaceURI() 

// Retrieve the namespace prefix of a node 
DOMNode::GetPrefix() 

// Retrieve just the name part of the node name (omitting the namespace prefix) 
DOMNode::GetLocalName() 

// Retrieve a list of attributes with the given name, 
// and the given namespace identifier ("*" means all namespaces). 
DOMElement::GetAttributeNodeNS(const DOMString& namespaceURI, const DOMString& localname);

You can choose whether or not you want to use the namespace features of the DOM and stick to the appropriate set of method calls - those that support namespaces or those that do not.