Parsing XML is one of those tasks that is easy to get 80% right, and hard to get 100% correct. As a result, there are many XML parsers in Lisp but only a few you can trust to handle all the files you might find. One of the more robust XML parsers is CXML.

Install CXML

Installing CXML by hand can be tedious because it depends on several other Lisp libraries, which in turn depend on other libraries. Fortunately, QuickLisp knows about CXML. All you need to do is:

(ql:quickload "cxml")

Wait until all the downloading and compiling settles down.

Test the XML Parser

If you've never worked with XML before, read this quick review first.

First, do the simple example.xml test given on the CXML Quickstart page. You can create the one-line XML file using the Lisp shown or with your Lisp editor. Note that this XML file doesn't include any DOCTYPE information.

Use the CXML DOM functions to make sure the DOM tree has all the parts you expect. Unfortunately, the author of CXML didn't explicitly document the DOM functions. Instead, he just says that CXML implements the standard DOM IDL (interface description language). The DOM IDL is class-based, like C++ or Java. Each class has data fields and functions. In CXML, DOM classes, fields, and methods are implemented using CLOS classes and methods. CamelCase names are mapped to hyphenated names, e.g., tagName in the DOM IDL becomes dom:tag-name in Lisp.

To get you started, here are some examples of how the standard DOM IDL is mapped to corresponding CXML methods.

DOM IDLCXML
ClassField or methodClassmethod
Document documentElement document (dom:document-element document)
Document doctype document (dom:doctype document)
Node childNodes node (dom:child-nodes node)
Element tagName element (dom:tag-name element)
Element getAttribute(name) element (dom:get-attribute element name)
CharacterData data character-data (dom:data character-data)

Exercise for the reader: From the DOM IDL, figure out what function(s) to call to get all of the attributes defined on a given XML element.

Various DOM functions, such as dom:child-nodes, return "lists" but they are not Lisp lists. They are more like vectors or arrays. Your code will often need to dive deep into these vectors. For clarity and portability across possible changes to CXML data structures, use the following utilities for these vectors:

Using recursion, you can explore the entire XML tree with these functions.

After checking out the CXML example, create a new XML file with this book example and inspect the parts of the XML file. Do you see three children of the book element?

XML quick review

There are two kinds of ways to process XML:

For our purposes, DOM parsing will suffice for now.

A DOM tree consists of

An XML DOM is a tree data structure. The central element is the XML node. XML elements and documents are subtypes of XML nodes. XML nodes are a recursive data structure, consisting of

Here's an example of a small XML root element in the form it might be found in a file.

<book title="The Moon is Blue" published="1953">
  <author>John Steinbeck</author>
<book>

The root element has the tag name book. It has two attribute value pairs. Note that values are always strings. The body has another XML element with the tag name author. The author element has no attributes. It has a body consisting of an implicit text element with the data John Steinbeck.

A common mistake when processing XML is forgetting about text elements. The book element actually has three child nodes: a text element with the whitespace between the closing > on the first line and the opening < on the second line, the author element, and another text element for the whitespace between the closing > on the second line and the opening < on the third line.

Need more information? Try this tutorial from IBM.

Faculty: Chris Riesbeck
Time: MWF: 11:00am-11:50am
Location: Tech LR 5

Contents

Important Links