Per discussions in CSS Selector in Perl, I made a quick perl module HTML::Selector::XPath, which is available at http://svn.bulknews.net/repos/public/HTML-Selector-XPath/trunk/ now.
The code is based on javascript code available on http://dev.rubyonrails.org/ticket/5171 which looks a little buggy, and was slightly modified using more accurate table on http://plasmasturm.org/log/444/ (Thanks Aristotle!)
See the test suite 02_html.t how to use this module combined with HTML::TreeBuilder::XPath (yeah, I plan to release a glue module for H::TB::XPath anyway), to extract content from (X)HTML using Xpath expression. Now your scraping code is hopefully free from nasty regexps!
I'll upload this module to CPAN shortly but give it a shot if you're interested.
CSS3 support (Score:1)
Hmm, it seems entirely possible to express all of CSS 3 in terms XPath 1.0; no XPath 2.0 required.
I just haven’t gotten to it – honestly, because I was too lazy. CSS 3 has many more syntax elements than the CSS 2 and the new ones are much more complex, so it’s not quite the same kind of 5-minute job.
Re: (Score:1)
Re: (Score:1)
Seems to me that a
[not(subexpr)]predicate should work. The only trick is to get any references to the context node right insubexpr, I suppose by usingself::*or something.Actually, now that you have written the module I may get around to it sooner, since there are working unit tests in there…
Re: (Score:1)
Re: (Score:1)
Add a case for
*:not(p)and see if that works. The correct translation should be*[not(self::p)], I think.Re: (Score:1)
To support that I should rewrite the parser algorithm somehow, and it will be done when I decide to do a complete CSS 3 selectors support. For now it'll croak.
Why not XML::LibXML in HTML mode? (Score:2)
Re: (Score:1)
I just picked HTML::TreeBuilder::XPath because i thought it'd be more relaxed to handle non-well-balanced HTML. XML::LibXML::Parser says "HTML (strcit) documents" and that makes me a little nervous
Re: (Score:1)
libxml2’s HTML mode is lenient, but not very lenient. It’s not that hard to make it choke. For processing your own stuff (or for generally markup-sparse things like weblog posts or comments or such) it’s fine, but out there on the open web it doesn’t cut it.
I prefer using HTMLTidy to beat things into shape, configured to give me XHTML, which I can then parse with a strict XML parser.
TagSoup also works. (Someone should port that one to Perl and/or C…)