Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Perl and XML (Score:5, Insightful)

    Your assertion does not accurately summarize my experiences with Perl and XML.

    First, lots of Perl programmers have embraced XML. There was a period of time when the only module for parsing xml was XML::Parser and a few half-finished attempts at doing something differently. Today, there are many polished alternatives for processing XML, including the interchangeable PerlSAX framework which mimics SAX in Java. In fact, some ideas crop up first in Perl (or rather in Barrie Slaymaker's head) before they ar

    • It's a pain to screen scrape an HTML page with Perl, but it's more of a pain to do it in Java.

      Matt Sergeant [perl.org], AxKit [axkit.org]'s father, cooked up a neat approach to this: use libxml2 (via XML::LibXML [cpan.org]) to parse the HTML in html and recover modes, then apply normal XML tools to it. I've not tried it, but I'd like for you to be able to do that and use XML::Filter::Disparcher [cpan.org] to pluck out strings from the resulting XML stream using rules like:

          'string( foo/p )' => sub { print "foo/p contains '", xvalu

      • by ziggy (25) on 2003.02.05 18:59 (#16749) Journal
        Matt Sergeant, AxKit's father, cooked up a neat approach to this: use libxml2 (via XML::LibXML) to parse the HTML in html and recover modes, then apply normal XML tools to it.
        Matt's mentioned this on more than one occasion. I always thought that libxslt/xsltproc was "broken" in its support for parsing HTML. I don't know how I came to that conclusion, but it must have been based on an early release of libxslt.

        Anyway, later that day, on Matt's urging, I wrote a quick little XSLT stylesheet to grep out the important bits of a document and massaged it with xsltproc. Sure enough, it worked exactly like it was supposed to, exactly how it was documented. (I can't believe I held off on that for so very long...)

        I forget what project that was, or where the code is, or what exactly I was munging at the time. I do remember that I iteratively developed the stylesheet to emit a simple text format (a bunch of lines or something). The last step was embedding the stylesheet in the __DATA__ section of a Perl script and gluing/automating the process with some Perly bits.

        In a bizarre kind of way, it was sort of fun!