Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Can you find the bug?

    Er... which one? :-)

    The one that fails to encode the left angle bracket in what is (presumably) character data, or the one that assumes that   is the only built-in character entity that will be found in an HTML document?

    See here [] for the real rules.

    If you intimately know and control the documents being processed, your scraper is naive but workable. I can only hope, however, that you aren't going offer this as a generic solution. It is not.

    Given that there are mature tools available that would convert the dirtiest of HTML into XML and let you operate on *that* to do your extraction, I have to wonder why you'd go after a solution like this in the first place. "Oh, I'll just whack the string" solutions may be fun exercises, but they can only lead markup-n00bs astray and should *not* be used as examples.

    BAD GNAT, NO COOKIE!!! ;->


    • Blah blah :-) Yes, I could be more rigorous with entities. It works for the specific documents I was scraping. The bug I was referring to is a Perl bug, not a design bug.

      If I had to convert the HTML to XML and work on that, I'd slit my wrists. For all the haughty condescension about "naive but workable", the key part is "workable". It was easy to write and worked. This isn't a generic solution to extracting information, but it's a very nice specific solution.