pod

xml

XML Parser and Document Modeling

classes

XAttr

XAttr models an XML attribute in an element.

XDoc

XML document encapsulates the root element and document type.

XDocType

XML document type declaration (but not the whole DTD).

XElem

Models an XML element: its name, attributes, and children nodes.

XNode

XNode is the base class for XElem and XText.

XNs

Models a XML Namespace uri.

XParser

XParser is a simple, lightweight XML parser.

XPi

XML processing instruction node.

XText

XText represents the character data inside an element.

enums

XNodeType

Enumerates the type of XNode and current node of XParser.

errs

XErr

XML exception.

XIncompleteErr

Incomplete document exception indicates that the end of stream was reached before the end of the document was parsed.

docs

Overview

The xml API provides the core APIs for working with XML:

  1. XElem: Provides a standard representation of an XML element tree to be used in memory. It is similar to the W3's DOM, but Fantom centric.
  2. XParser: is a non-validating XML parser. It may be used in two modes: to read an entire XML document into memory or as a pull-parser.

The features supported by XParser:

  • All element, attribute, processing instructions, and character data productions are supported
  • CDATA sections are supported
  • Namespaces are supported at both the element and attribute level
  • Doctype declarations provide access to the public and system identifiers
  • DTDs are ignored (as such you can't use internal or external entity declarations)
  • No access to comments is provided by the XML parser
  • Character data consisting only of whitespace is always ignored

DOM

XML documents are modeled in memory using the following classes:

  • XDoc: models the entire document providing access to the root element, doctype, and processing instructions declared before the root element
  • XElem: models an element name, attributes, and children nodes
  • XText: models character data in an element
  • XPi: models a processing instruction
  • XAttr: models an attribute name/value pair
  • XNs: models the prefix and URI of a XML namespace

XML documents are structured as a tree of nodes using the classes listed above. Typically the tree is built by parsing a XML document from an input stream. But you can construct a document in memory using the APIs and it-blocks:

doc := XDoc
{
  XElem("root")
  {
    XElem("a") { addAttr("flags", "0"); XText("blah"), },
    XElem("b") { XElem("c"), },
    XElem("m") { XText("I "), XElem("i") { XText("mean"), }, XText(" it!"), },
  },
}
doc.write(Env.cur.out)

// would print the following to the console
<?xml version='1.0' encoding='UTF-8'?>
<root>
 <a flags='0'>blah</a>
 <b>
  <c/>
 </b>
 <m>I <i>mean</i> it!</m>
</root>

Namespaces

The XNs class is used model XML namespaces. The default namespace is indicated with the prefix of "". Both XElem and XAttr define a set of methods for working with qualified names:

ns := XNs("x", `urn:foo`)
e := XElem("name", ns)

e.ns      =>  ns
e.prefix  =>  "x"
e.uri     =>  `urn:foo`
e.name    =>  "name"
e.qname   =>  "x:name"

If you construct a document by hand, then you are responsible for creating each element with the correct XNs and ensuring that the appropriate namespace attributes are declared:

nsDef := XNs("", `http://foo/default`)
nsBar := XNs("b", `http://foo/bar`)
root := XElem("root", nsBar)
{
  XAttr.makeNs(nsDef),
  XAttr.makeNs(nsBar),
  XElem("elem", nsDef),
  XElem("elem", nsBar),
}
root.write(Env.cur.out)

// would print the following to the console
<b:root xmlns='http://foo/default' xmlns:b='http://foo/bar'>
 <elem/>
 <b:elem/>
</b:root>

Writing

All the XML node classes support a write method which takes an OutStream. During debugging it is often convenient to write to standard out:

node.write(Env.cur.out)
Env.cur.out.flush

To write an XML document from memory to a file:

out := `test.xml`.toFile.out
doc.write(out)
out.close

If you are generating an XML document on the fly you might want to use OutStream directly. It supports escaping XML control characters via the writeXml method:

// escape markup in text
out.writeChars("<text>").writeXml(text).writeChars("</text>")

// escape markup and quotes
out.writeChars("attr='").writeXml(attrVal, OutStream.xmlEscQuotes).writeChars("'")

You can use Str.toXml to generate a string with XML markup escaped. However, you should prefer OutStream when streaming which is more efficient.

Parsing

The XParser class is used to parse XML input streams into XElems. The easiest way to do this is to parse the entire document into memory using the parseDoc method:

// parse and close input stream
doc := XParser(in).parseDoc

// parse a file
doc := XParser(`test.xml`.toFile.in).parseDoc

// parse a string
doc := XParser("<foo/>".in).parseDoc

The code above parses the document entirely into memory. This tends to be easiest way to work with XML documents. However it can create efficiency problems when parsing large documents, especially when mapping the XElems into other data structures. To support more efficient parsing of XML streams, XParser may also be used to read elements off the input stream one at a time. This is similar to the SAX API, except you pull events instead of having them pushed to you.

To perform pull parsing, use the next method to iterate through the document. This tokenizes the stream into XElem, XText, and XPi chunks. Each call to next advances to the next token and returns its node type. You may also check the type of the current token using nodeType. You may access the current token using elem, text, or pi.

XParser maintains a stack of XElems for you from the root element down to the current element. You may check the depth of the stack using the depth method. Get the current element at any position in the stack via elemAt.

It is very important to understand the XElem at given depth is only valid until the parser returns elemEnd for that depth. After that the element will be reused. A XText instance is only valid until the next call to next. You can make a safe copy of nodes using copy. You can also use parseElem to read the current element and its is descendants into memory during pull-parsing.

Example of pull parsing:

parser := XParser(
  "<root>
     <a>text</a>
   </root>".in)

while (parser.next != null)
  echo("$parser.nodeType $parser.depth $parser.elem")

// prints the following to the console
elemStart 0 <root>
elemStart 1 <a>
text      1 <a>
elemEnd   1 <a>
elemEnd   0 <root>