Saturday, February 5, 2011

HtmlAgilityPack and XPath Peculiarity

Parsing several HTML pages I noticed that HtmlAgilityPack doesn't consider that its node has relative path for XPath. The following code illustrates this:

var html = @"
<div class="
    <p>header <span>paragraph 1-1</span></p>
    <p>header <span>paragraph 1-2</span></p>
<div class="
    <p>content <span>paragraph 2-1</span></p>
    <p>content <span>paragraph 2-2</span></p>

var doc = new HtmlDocument();

var node = doc.DocumentNode.SelectSingleNode("div[1]/p[1]");

Console.WriteLine("\r\n1st <p> in 1st <div>:");

Console.WriteLine("\r\nCount of <span> (//):");

Console.WriteLine("\r\nCount of <span> (.//):");

It produces the output:

1st <p> in 1st <div>:
<p>header <span>paragraph 1-1</span></p>

Count of <span> (//):

Count of <span> (.//):
1 says that "//" selects nodes "from the current node". So does it mean that HtmlAgilityPack works wrong?

Learning XPath on I had no doubt. But W3C specification says that it's alright:

//para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node
.//para selects the para element descendants of the context node

All I wanna say is that you must be cautious to the information you got, even if it from the popular site with a good reputation (like w3schools is). "Trust no one", like Horde says :).

