Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Hmm, it seems entirely possible to express all of CSS 3 in terms XPath 1.0; no XPath 2.0 required.

    I just haven’t gotten to it – honestly, because I was too lazy. CSS 3 has many more syntax elements than the CSS 2 and the new ones are much more complex, so it’s not quite the same kind of 5-minute job.

    • Really? That sounds great. I was translating :not() CSS 3 selector but couldn't find how to map to XPath 1.0 without using :not(). Maybe I'm missing something obvious?
      • Seems to me that a [not(subexpr)] predicate should work. The only trick is to get any references to the context node right in subexpr , I suppose by using self::* or something.

        Actually, now that you have written the module I may get around to it sooner, since there are working unit tests in there…

        • Aha, cool. Now I fixed how to handle :not() pseudo-class and map it to [not()], which worked. See updated unit test [bulknews.net] to confirm. Thanks!
          • Add a case for *:not(p) and see if that works. The correct translation should be *[not(self::p)], I think.

            • It doesn't work, at least for now.

              To support that I should rewrite the parser algorithm somehow, and it will be done when I decide to do a complete CSS 3 selectors support. For now it'll croak.
  • I've used XML::LibXML in HTML mode. It'll be far faster, and when you wrap it with the xsh language, it's even better (and xsh version 2.0 is getting some very neat features).
    --
    • Randal L. Schwartz
    • Stonehenge
    • Yeah, XML::LibXML in HTML mode would work, too.

      I just picked HTML::TreeBuilder::XPath because i thought it'd be more relaxed to handle non-well-balanced HTML. XML::LibXML::Parser says "HTML (strcit) documents" and that makes me a little nervous :)
    • libxml2’s HTML mode is lenient, but not very lenient. It’s not that hard to make it choke. For processing your own stuff (or for generally markup-sparse things like weblog posts or comments or such) it’s fine, but out there on the open web it doesn’t cut it.

      I prefer using HTMLTidy to beat things into shape, configured to give me XHTML, which I can then parse with a strict XML parser.

      TagSoup also works. (Someone should port that one to Perl and/or C…)