Parsing XML is one of those tasks that is easy to get 80% right, and hard to get 100% correct. As a result, there are many XML parsers in Lisp but only a few you can trust to handle all the files you might find. One of the more robust XML parsers is CXML.
Install CXML
Installing CXML by hand can be tedious because it depends on several other Lisp libraries, which in turn depend on other libraries. Fortunately, QuickLisp knows about CXML. All you need to do is:
(ql:quickload "cxml")
Wait until all the downloading and compiling settles down.
Test the XML Parser
If you've never worked with XML before, read this quick review first.
First, do the simple example.xml test given on the CXML Quickstart page. You can create the one-line XML file using the Lisp shown or with your Lisp editor. Note that this XML file doesn't include any DOCTYPE information.
Use the CXML DOM functions to make sure the DOM tree has all the parts you expect. Unfortunately, the author of CXML didn't explicitly document the DOM functions. Instead, he just says that CXML implements the standard DOM IDL (interface description language). The DOM IDL is class-based, like C++ or Java. Each class has data fields and functions. In CXML, DOM classes, fields, and methods are implemented using CLOS classes and methods. CamelCase names are mapped to hyphenated names, e.g., tagName in the DOM IDL becomes dom:tag-name in Lisp.
To get you started, here are some examples of how the standard DOM IDL is mapped to corresponding CXML methods.
DOM IDL | CXML | ||
---|---|---|---|
Class | Field or method | Class | method |
Document | documentElement | document | (dom:document-element document) |
Document | doctype | document | (dom:doctype document) |
Node | childNodes | node | (dom:child-nodes node) |
Element | tagName | element | (dom:tag-name element) |
Element | getAttribute(name) | element | (dom:get-attribute element name) |
CharacterData | data | character-data | (dom:data character-data) |
Exercise for the reader: From the DOM IDL, figure out what function(s) to call to get all of the attributes defined on a given XML element.
Various DOM functions, such as dom:child-nodes, return "lists" but they are not Lisp lists. They are more like vectors or arrays. Your code will often need to dive deep into these vectors. For clarity and portability across possible changes to CXML data structures, use the following utilities for these vectors:
- (dom:item nodelist n): like nth, this returns the Nth element (zero-based) of the nodelist
- (dom:map-node-list fn nodelist): like mapc, this calls the function on each node in the nodelist
- (dom:do-node-list (var nodelist) body): like dolist, this executes the body once for each node in the nodelist, with the variable var bound to each node.
Using recursion, you can explore the entire XML tree with these functions.
After checking out the CXML example, create a new XML file with this book example and inspect the parts of the XML file. Do you see three children of the book element?
XML quick review
There are two kinds of ways to process XML:
- SAX parsing is appropriate for large files, particularly if you're only going to extract relatively small amounts of information. SAX parsing is more complicated to use because you have to define call-back functions to respond to different XML parsing events.
- DOM processing is more appropriate for small files. You call a function (often a SAX parser) to construct a Document Object Model (DOM) tree of all the XML elements in a file. You can then extract the data you need from the DOM tree.
For our purposes, DOM parsing will suffice for now.
A DOM tree consists of
- a DOCTYPE declaration that says what version of XML the document is; this is like the DOCTYPE declaration in HTML files
- a root XML element; a valid XML file can have only one top XML element
An XML DOM is a tree data structure. The central element is the XML node. XML elements and documents are subtypes of XML nodes. XML nodes are a recursive data structure, consisting of
- A tag name
- zero or more attribute-value pairs
- a body (possibly empty) with text and/or nested XML nodes
Here's an example of a small XML root element in the form it might be found in a file.
<book title="The Moon is Blue" published="1953"> <author>John Steinbeck</author> <book>
The root element has the tag name book. It has two attribute value pairs. Note that values are always strings. The body has another XML element with the tag name author. The author element has no attributes. It has a body consisting of an implicit text element with the data John Steinbeck.
A common mistake when processing XML is forgetting about text elements. The book element actually has three child nodes: a text element with the whitespace between the closing > on the first line and the opening < on the second line, the author element, and another text element for the whitespace between the closing > on the second line and the opening < on the third line.
Need more information? Try this tutorial from IBM.