Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Monday October 02, 2006
11:54 AM

HTML::Selector::XPath

[ #31195 ]

Per discussions in CSS Selector in Perl, I made a quick perl module HTML::Selector::XPath, which is available at http://svn.bulknews.net/repos/public/HTML-Selector-XPath/trunk/ now.

The code is based on javascript code available on http://dev.rubyonrails.org/ticket/5171 which looks a little buggy, and was slightly modified using more accurate table on http://plasmasturm.org/log/444/ (Thanks Aristotle!)

See the test suite 02_html.t how to use this module combined with HTML::TreeBuilder::XPath (yeah, I plan to release a glue module for H::TB::XPath anyway), to extract content from (X)HTML using Xpath expression. Now your scraping code is hopefully free from nasty regexps!

I'll upload this module to CPAN shortly but give it a shot if you're interested.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Hmm, it seems entirely possible to express all of CSS 3 in terms XPath 1.0; no XPath 2.0 required.

    I just haven’t gotten to it – honestly, because I was too lazy. CSS 3 has many more syntax elements than the CSS 2 and the new ones are much more complex, so it’s not quite the same kind of 5-minute job.

    • Really? That sounds great. I was translating :not() CSS 3 selector but couldn't find how to map to XPath 1.0 without using :not(). Maybe I'm missing something obvious?
      • Seems to me that a [not(subexpr)] predicate should work. The only trick is to get any references to the context node right in subexpr , I suppose by using self::* or something.

        Actually, now that you have written the module I may get around to it sooner, since there are working unit tests in there…

        • Aha, cool. Now I fixed how to handle :not() pseudo-class and map it to [not()], which worked. See updated unit test [bulknews.net] to confirm. Thanks!
          • Add a case for *:not(p) and see if that works. The correct translation should be *[not(self::p)], I think.

            • It doesn't work, at least for now.

              To support that I should rewrite the parser algorithm somehow, and it will be done when I decide to do a complete CSS 3 selectors support. For now it'll croak.
  • I've used XML::LibXML in HTML mode. It'll be far faster, and when you wrap it with the xsh language, it's even better (and xsh version 2.0 is getting some very neat features).
    --
    • Randal L. Schwartz
    • Stonehenge
    • Yeah, XML::LibXML in HTML mode would work, too.

      I just picked HTML::TreeBuilder::XPath because i thought it'd be more relaxed to handle non-well-balanced HTML. XML::LibXML::Parser says "HTML (strcit) documents" and that makes me a little nervous :)
    • libxml2’s HTML mode is lenient, but not very lenient. It’s not that hard to make it choke. For processing your own stuff (or for generally markup-sparse things like weblog posts or comments or such) it’s fine, but out there on the open web it doesn’t cut it.

      I prefer using HTMLTidy to beat things into shape, configured to give me XHTML, which I can then parse with a strict XML parser.

      TagSoup also works. (Someone should port that one to Perl and/or C…)