Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

gnat (29)

  (email not shown publicly)

Journal of gnat (29)

Tuesday April 29, 2003
07:38 PM

Spot the Bug

[ #11914 ]
I pasted my screenscraping code in a comment. There's a subtle bug in the way it deals with HTML, though. It's nothing to do with LWP, fetching the pages, or the use of HTML::TableContentParser. The bug leads to invalid XML. Can you find the bug?


The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Guessing (Score:3, Insightful)

    by Ovid (2709) on 2003.04.29 20:00 (#19595) Homepage Journal

    Well, in clean(), we see this:

    $text =~ s{<.*?>}{}g;

    Not being familiar with HTML::TableContentParser, I can only guess, but you appear to be expecting tags, but if you have a tag with attributes and one of them has a value with a greater than ">", then this breaks.

    • Re:Guessing (Score:3, Interesting)

      You're right, this isn't perfect. However, this wasn't the bug that was giving me invalid output. I had no HTML where <.*?> was insufficient.

      The problem was that I had a certain character unescaped in the HTML, and my solution to this was naive to say the least...


  • Can you find the bug?

    Er... which one? :-)

    The one that fails to encode the left angle bracket in what is (presumably) character data, or the one that assumes that &nbsp; is the only built-in character entity that will be found in an HTML document?

    See here [] for the real rules.

    If you intimately know and control the documents being processed, your scraper is naive but workable. I can only hope, however, that you aren't going offer this as a generic solution. It is not.

    Given that there are ma

    • Blah blah :-) Yes, I could be more rigorous with entities. It works for the specific documents I was scraping. The bug I was referring to is a Perl bug, not a design bug.

      If I had to convert the HTML to XML and work on that, I'd slit my wrists. For all the haughty condescension about "naive but workable", the key part is "workable". It was easy to write and worked. This isn't a generic solution to extracting information, but it's a very nice specific solution.


  • The bug is in the clean subroutine. I say


    The expectation was that \b matches at word boundaries, so that HTML & CSS would become HTML &amp; CSS. Foolish me. There is no word boundary between a space and an ampersand. I gave up trying to be smart and took out the \b's to make it work. I could have tried capturing and replacing the spaces, but until it breaks and I need to be smart again, I'll continue with my dumb approach.

    I need to pay more attention to my own advic

    • In real-world text, \b normally works fine for me for matching word boundaries. Of course, I don't think of "&" as a word when I'm doing text searches. I really can't remember having much trouble with it -- it's the same as using most search engines.

      Also, I've occasionally used \b in one-off (or even one-liner) HTML manipulations with things like s{</?(font|b|i)\b[^>]*>}{}gi. The \b ensures that I'm getting the whole HTML element name (not matching <br> when looking for <b>, for