XPath is actually pretty useful once it stops being confusing

Mat Brown

Tech

Tekst piosenki

I first met XPath in 2007, but we didn't become friends until just recently. For the most part I had avoided it; when forced to use it, I made do with trial and error. XPath just didn't really make sense to me. But then I came across a peculiar parsing problem—too complex for CSS selectors, too simple to warrant hand-rolled code—and decided to give XPath another shot. I discovered, much to my surprise and glee, that it does make sense, and once it makes sense, it's actually quite useful. This is my story. The problemSay you're working on a website full of song lyrics, and in order to maintain a consistent reading experience, you want to capitalize the first word of every line. If the lyrics are stored in plain text this is pretty straightforward: lyrics.gsub!(/^./) { |character| character.upcase }But it gets more interesting if the lyrics are stored as an HTML fragment. A DOM doesn't have any built-in concept of "lines." You can't just break it up with a simple regular expression. So the first thing we'll need to do is define, for ourselves, what "the beginning of a line" means in a DOM. Here's a simple version: The first text node inside a

tag The first text node following a tag So, in the simplest case:

This is the beginning of a line.This is too.

But we also want to handle nested inline elements:

This is the beginning of a line. This is not.

I'll take the low roadMy first instinct was to just write a Ruby method to scan over relevant parts of the DOM and recursively seek out text nodes that fit our criteria. I used some very light CSS selectors, but nothing too fancy: def each_new_line(document) document.css('p').each { |p| yield first_text_node(p) } document.css('br').each { |br| yield first_text_node(br.next) } enddef first_text_node(node) if node.nil? then nil elsif node.text? then node elsif node.children.any? then first_text_node(node.children.first) end end def first_text_node(node) if node.nil? then nil elsif node.text? then node elsif node.children.any? then first_text_node(node.children.first) end end This is a perfectly reasonable solution, but it's a whopping 11 lines of code. Further, it feels like we're using the wrong tool for the job: why are we using Ruby iterators and conditionals to get at DOM nodes? Can we do better? Enter XPathXPath is confusing for a couple of reasons. The first is that there are surprisingly few good references on the Internet (don't even think about looking at W3Schools!). The best doc I've found is the RFC itself. The second is that XPath looks deceptively like CSS. The word "path" is right there in the name, and so I had always assumed, mistakenly, that the / in an XPath expression plays the same role as the > in a CSS selector: document.xpath('//p/em/a') == document.css('p > em > a')As it turns out, the XPath expression involves a lot of shorthand, which we'll want to explode in order to really understand what's going on. Here's the same expression written out in longhand: /descendant-or-self::node()/child::p/child::em/child::a/This XPath expression and the CSS selector above are equivalent, but not for the reason I had always assumed. An XPath expression consists of one or more “location steps” separated by forward slashes. The / at the beginning means the context of the first step is the root node of the document. Each location step knows which nodes have already been matched, and uses that context to answer three questions: Where do I want to move from the current context?This is called the Axis, and it's optional. The default axis is child, meaning "select all of the children of the currently selected nodes." In the above example, descendant-or-self is the axis for the first location step, meaning "all of the currently selected nodes and all of their descendants." Most of the axes defined by the XPath spec are likewise intuitively named. What sort of nodes do I want to select?Am I selecting

tags, text nodes, or is it a free-for-all? This is specified by the node test, which is the only required part of the location step. In our above example, node() is the most permissive node test: it selects everything. text() would only select text nodes; element() would only select elements; and explicitly specified node names like p and em above, of course, would only select elements with those names. Are there additional filters I want to add?Maybe I only want to select the first child of every node in the current context, or I only want to select