It's almost the end of 2009, and I have to ask: are we through dealing with XML yet?
Although many of us wish we could consume the web through a magic programmer portal that shields us and our code from all the pointy angle brackets, the reality that is the legacy of HTML, Atom and RSS on the web leaves us little choice but to soldier on. So let's take a look at what Ruby-colored armor is available to us as we continue our quest to slay the XML dragons.
Historically, Ruby has had a number of options for dealing with structured markup, though oddly none have reached a solid consensus among Ruby developers as the “go to” library. The earliest available library seems to be Yoshida Masato's
XMLParser, which wraps Expat and was first released around the time that Expat itself was released, back in 1998. A pure Ruby parser by Jim Menard called NQXML appeared in 2001, though it never matured to the level of a robust XML parser.
In late 2001, Matz expressed his desire for out of the box XML support, but sadly, nothing appeared in Ruby's standard library until 2003, when REXML was imported for the 1.8.0 release. After reading bike-shed discussions like this one on ruby-talk in November 2001, or this wayback-machine page from the old RubyGarden wiki, it's not hard to see why. Meanwhile, other language runtimes, such as Python and Java, moved along and built solid, acceptable foundations, making Ruby's omission seem more glaring.
But all was not lost: Ruby has always had a quality without a name that made it a great language for distilling an API. All that was needed was an infusion of interest and talent in Ruby, and a few more experiments and iterations.
Fast forward to the present time, and all those chips have fallen. We've seen evolution from REXML to libxml-ruby to Hpricot, and finally to Nokogiri. So, is the XML landscape on Ruby so dire? Certainly not, as you'll see by the end of this article! While the standard library support for XML hasn't progressed beyond REXML yet, state-of-the-art solutions are a few keystrokes away.
A big part of what makes XML such a pain to work with is the APIs. We Rubyists tend to have an especially low tolerance for friction in API design, and we really feel it when we work with XML. If XML is just a tree structure, why isn't navigating it as simple and elegant as traversing a Ruby
Instead, we'll take a brief exploratory tour of some Ruby XML APIs using code examples. Though some of the examples may seem trivially short, don't underestimate their power. Conciseness and readability are Ruby's gifts to the library authors and they're being put to good use.
The libraries we'll use for comparison are REXML, Nokogiri, and JAXP, Java's XML parsing APIs (via JRuby).
The simplest possible thing to do in XML is to hand the library some XML and get back a document.
Both REXML and Nokogiri more or less get this right. What's also nice is that they both transparently accept either an IO-like object or a string. Contrast this to Java:
In that familiar Java style, the JAXP approach forces you to choose from many options and write more code for the happy path. JRuby helps you a little bit by converting a Ruby string into a Java string, but needs a little help with intent for converting an
IO to a Java
Now that we've got a document object, let's query it via XPath, assuming the underlying format is an Atom feed. Here is the code to grab the entries' titles and store them as an array of strings:
Again, both REXML and Nokogiri clock in at similar code sizes, but subtle differences begin to emerge. Nokogiri's use of
#xpath as an instance method on the document object feels more natural as a way of drilling down for further detail. Also, note that both APIs return DOM objects for the text, so we need to take one more step to convert them to pure Ruby strings. Here, Nokogiri's use of the standard
String#to_s method is more intuitive;
REXML::Text's version returns the raw text without the entities replaced.
Unfortunately, doing XPath in Java gets a bit more complicated. First we need to construct an
XPath object. At least JRuby helps us a bit here–we can create an instance of the
NamespaceContext interface completely in Ruby, and omit the methods we don't care about.
Next, we evaluate the expression and construct the array titles:
That last bit where we need to externally iterate the DOM API is particularly un-Ruby-like. With JRuby we can mix in some methods to the NodeList class:
And replace the external iteration with a more natural internal one:
This kind of technique tends to become a fairly common occurrence when coding Ruby to Java libraries in JRuby. Fortunately Ruby makes it simple to hide away the ugliness in the Java APIs!
Walking the DOM
Say we'd like to explore the DOM. Both REXML and Nokogiri provide multiple ways of doing this, with parent/child/sibling navigation methods. They also each sport a recursive descent method, which is quite convenient.
Needless to say, Java's DOM API has no such convenience method, so we have to write one. But again, JRuby makes it easy to Rubify the code. Note that our
#traverse method makes use of our
Enumerable-ization of NodeList above as well.
All three libraries have a pull parser (also called a stream parser or reader) as well. Pull parsers are efficient because they behave like a cursor scrolling through the document, but usually result in more verbose code because of the need to implement a small state machine on top of lower-level XML events. They are best employed on very large documents where it's impractical to store the entire DOM tree in memory at once.
(Aside to the Nokogiri team: where are the reader node type constants?)
Not surprisingly, all three pull parser examples end up looking very similar. The subtleties of the pull parser APIs end up getting blurred in the loops and conditionals. Only write this code when you have to.
At the end of the day, it comes down to performance, doesn't it? Although the topic of Ruby XML parser performance has been discussed before, I thought it would be instructive to do another round of comparisons with JRuby and Ruby 1.9 thrown into the mix.
- Mac OS X 10.5 on a MacBook Pro 2.53 GHz Core 2 Duo
- Ruby 1.8.6p287
- Ruby 1.9.1p243
- JRuby 1.5.0.dev (rev c7b3348) on Apple JDK 5 (32-bit)
- Nokogiri 1.4.0
- libxml2 2.7.3
Here are results comparing Nokogiri and Hpricot on the three implementations along with the JAXP version which only runs on JRuby (smaller is better).
The REXML results were over an order of magnitude slower, so it's easier to view them on a separate graph. Note the number of iterations here is 100 vs. 1000 for the results above.
While these results don't paint a complete picture of XML parser performance, they should give you enough of a guideline to make a decision on which parser to use once you take portability and readability into account. In summary:
- Use REXML when your parsing needs are minimal and want the widest portability (across all implementations) with the smallest install footprint.
- Use JRuby with the JAXP APIs for portability across any operating system that supports the Java platform (including Google AppEngine).
- Use Nokogiri for everything else. It's the fastest implementation, and produces the most programmer-friendly code of all Ruby XML parsers to date.
(As a footnote, we on the Nokogiri and JRuby teams are looking for community help to further develop the pure-Java backend for Nokogiri so that AppEngine and other JVM deployment scenarios that don't allow loading native code can benefit from Nokogiri's awesomeness. Please leave a comment or contact the JRuby team on the mailing list if you're interested.)
The source code for this article is available if you'd like to examine the code or run the benchmarks yourself. Keep an eye on the Engine Yard blog for an upcoming post on Nokogiri, and as always, leave questions and thoughts in the comments!