Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • not well-formed XML

    I thought that there is only well-formed XML. Anything that is not, is simply not XML. The intent being to avoid the tag soup and Do-What-I-Think-You-Meant heuristics that got us to the HTML we have today.

    Hence it sounds like even this so-called "RDF" that they are producing is fundamentally broken, if RDF is XML, and XML is well-formed. Not that this helps you, of course :-(

    • The RDF is well-formed, it's the Web site which is not. The RDF was very confusing, though, and I simply don't know it well enough to to manually use an XML parser to get all of the data I need.

      • Extracting information from RDF/XML with an XML parser is a fool’s errand. RDF is a graph model, and RDF/XML is merely one (fairly TMTOWTDI-heavy) representation of it. It is possible to design XML documents so that they can also be parsed as RDF, but if the data was modelled in RDF with no consideration given to the XML parsing case, then trying to parse its RDF/XML representation is likely to produce code more analogous to a regex-based HTML scraper than a parser.

  • I've done some work with Web::Scraper, and I found that I mostly give it XPath syntax, which it handles fairly well, even with tagsoup.

    I have a talk on Web::Scraper [datenzoo.de] online, but it's in German. The hilarious babelfish translation [66.196.80.202] might provide some shallow entertainment to you though.

  • The LibXML module has a parse_html method that can be used to parse any old crappy HTML. It does tend to spew warnings to STDERR whether you want them or not but you can localise a redirection of STDERR if you don't want them.
  • With HTML::TreeBuilder::XPath you can do xpath-like searches on HTML documents.