I pasted my screenscraping code in a comment. There's a subtle bug in the way it deals with HTML, though. It's nothing to do with LWP, fetching the pages, or the use of HTML::TableContentParser. The bug leads to invalid XML. Can you find the bug?
Not being familiar with HTML::TableContentParser, I can only guess, but you appear to be expecting tags, but if you have a tag with attributes and one of them has a value with a greater than ">", then this breaks.
The one that fails to encode the left angle bracket
in what is (presumably) character data, or the one that assumes that is the only built-in character entity that will be found in an HTML document?
If you intimately know and control the documents being processed, your scraper is naive but workable. I can only hope, however, that you aren't going offer this as a generic solution. It is not.
Blah blah:-) Yes, I could be more rigorous with entities. It works for the specific documents I was scraping. The bug I was referring to is a Perl bug, not a design bug.
If I had to convert the HTML to XML and work on that, I'd slit my wrists. For all the haughty condescension about "naive but workable", the key part is "workable". It was easy to write and worked. This isn't a generic solution to extracting information, but it's a very nice specific solution.
The expectation was that \b matches at word boundaries, so that HTML & CSS would become HTML & CSS. Foolish me. There is no word boundary between a space and an ampersand. I gave up trying to be smart and took out the \b's to make it work. I could have tried capturing and replacing the spaces, but until it breaks and I need to be smart again, I'll continue with my dumb approach.
In real-world text, \b normally works fine for me for matching word boundaries. Of course, I don't think of "&" as a word when I'm doing text searches. I really can't remember having much trouble with it -- it's the same as using most search engines.
Also, I've occasionally used \b in one-off (or even one-liner) HTML manipulations with things like s{</?(font|b|i)\b[^>]*>}{}gi. The \b ensures that I'm getting the whole HTML element name (not matching <br> when looking for <b>, for
Guessing (Score:3, Insightful)
Well, in clean(), we see this:
$text =~ s{<.*?>}{}g;Not being familiar with HTML::TableContentParser, I can only guess, but you appear to be expecting tags, but if you have a tag with attributes and one of them has a value with a greater than ">", then this breaks.
Reply to This
Re:Guessing (Score:3, Interesting)
The problem was that I had a certain character unescaped in the HTML, and my solution to this was naive to say the least...
--Nat
Re: Spot the Bug (Score:1)
Er... which one? :-)
The one that fails to encode the left angle bracket in what is (presumably) character data, or the one that assumes that is the only built-in character entity that will be found in an HTML document?
See here [w3.org] for the real rules.
If you intimately know and control the documents being processed, your scraper is naive but workable. I can only hope, however, that you aren't going offer this as a generic solution. It is not.
Given that there are ma
Re: Spot the Bug (Score:2)
If I had to convert the HTML to XML and work on that, I'd slit my wrists. For all the haughty condescension about "naive but workable", the key part is "workable". It was easy to write and worked. This isn't a generic solution to extracting information, but it's a very nice specific solution.
--Nat
Solution (Score:2)
The expectation was that \b matches at word boundaries, so that HTML & CSS would become HTML & CSS. Foolish me. There is no word boundary between a space and an ampersand. I gave up trying to be smart and took out the \b's to make it work. I could have tried capturing and replacing the spaces, but until it breaks and I need to be smart again, I'll continue with my dumb approach.
I need to pay more attention to my own advic
Re:Solution (Score:2)
In real-world text,
\bnormally works fine for me for matching word boundaries. Of course, I don't think of "&" as a word when I'm doing text searches. I really can't remember having much trouble with it -- it's the same as using most search engines.Also, I've occasionally used
\bin one-off (or even one-liner) HTML manipulations with things likes{</?(font|b|i)\b[^>]*>}{}gi. The\bensures that I'm getting the whole HTML element name (not matching<br>when looking for<b>, for