Parsing several HTML pages I noticed that HtmlAgilityPack doesn't consider that its node has relative path for XPath. The following code illustrates this:
var html = @"
<div class=""header"">
<p>header <span>paragraph 1-1</span></p>
<p>header <span>paragraph 1-2</span></p>
</div>
<div class=""content"">
<p>content <span>paragraph 2-1</span></p>
<p>content <span>paragraph 2-2</span></p>
<div>
";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode.SelectSingleNode("div[1]/p[1]");
Console.WriteLine("\r\n1st <p> in 1st <div>:");
Console.WriteLine(node.OuterHtml);
Console.WriteLine("\r\nCount of <span> (//):");
Console.WriteLine(node.SelectNodes("//span").Count);
Console.WriteLine("\r\nCount of <span> (.//):");
Console.WriteLine(node.SelectNodes(".//span").Count);
It produces the output:
1st <p> in 1st <div>: <p>header <span>paragraph 1-1</span></p> Count of <span> (//): 4 Count of <span> (.//): 1
w3schools.com says that "//" selects nodes "from the current node". So does it mean that HtmlAgilityPack works wrong?
Learning XPath on w3schools.com I had no doubt. But W3C specification says that it's alright:
//para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node
.//para selects the para element descendants of the context node
All I wanna say is that you must be cautious to the information you got, even if it from the popular site with a good reputation (like w3schools is). "Trust no one", like Horde says :).
No comments:
Post a Comment