Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

Ovid
  (email not shown publicly)
http://publius-ovidius.livejournal.com/
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Thursday November 20, 2008
11:46 AM

Oh god, please, no.

[ #37912 ]

Struggling all day with Gutenberg. Someone (not naming them as I don't have permission) sent me code to let me use Redland for my RDF parsing and it looks lovely. Too bad Redland doesn't compile for anyone. Didn't compile for me, either.

I put this aside for a bit and tried parsing result pages.

Tried to use the Web::Scraper module to at least pull results from Web pages, but I'm too stupid to figure out its syntax. Learning a new API, CSS selectors and battling strange "don't know what to do with undef" errors proved too much. Embarrassing.

I thought to use HTML::TableParser for some stuff, but that doesn't seem to let me at the attributes I need.

I thought XPath would be good, but it's not well-formed XML. Someone mentioned to me that there might be an XPath module which might have an option which might let you parse malformed XML. I didn't follow up on that.

I finally switch to my HTML::TokeParser::Simple module for this. It's not a good fit for this problem. No, scratch that. It's a bad fit for this problem, but it worked. Then I turned back to search. For this, I used WWW::Mechanize. Notice anything, um, crap about these damned results?

sub search {
    my $self = shift;
    my $mech = WWW::Mechanize->new(
        agent     => 'App::Gutenberg (perl)',
        autocheck => 1,
    );

    $mech->get(App::Gutenberg->search_url);

    $mech->submit_form(
        form_number => 1,
        fields      => {
            'author' => ($self->author || ''),
            'title'  => ($self->title  || ''),
        }
    );

    my $uri = $mech->uri;
    if ( $uri =~ /#([[:word:]]+)\z/ ) {
        # you have got to
    }
    else {
        # be kidding me
    }
}

If that URL matches, you're indexing into a list of <li> elements. Otherwise, you're parsing a table. Either way, it's a right pain to get the data you want. Oh, and it's subtly different sets of data and the criteria for why it would be one type of result or another is unclear.

This is why I want to see REST for just about anything today. It's simple. It's straightforward. It doesn't make me cry. Now I know why you don't see Perl command line clients for Gutenberg. Everything I'm writing is so damned fragile it will break if you look at it funny. *sniff*

Update: it looks like any search with an author will return a list, but all other searches (only tested the basic form) return tables.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • not well-formed XML

    I thought that there is only well-formed XML. Anything that is not, is simply not XML. The intent being to avoid the tag soup and Do-What-I-Think-You-Meant heuristics that got us to the HTML we have today.

    Hence it sounds like even this so-called "RDF" that they are producing is fundamentally broken, if RDF is XML, and XML is well-formed. Not that this helps you, of course :-(

    • The RDF is well-formed, it's the Web site which is not. The RDF was very confusing, though, and I simply don't know it well enough to to manually use an XML parser to get all of the data I need.

      • Extracting information from RDF/XML with an XML parser is a fool’s errand. RDF is a graph model, and RDF/XML is merely one (fairly TMTOWTDI-heavy) representation of it. It is possible to design XML documents so that they can also be parsed as RDF, but if the data was modelled in RDF with no consideration given to the XML parsing case, then trying to parse its RDF/XML representation is likely to produce code more analogous to a regex-based HTML scraper than a parser.

  • I've done some work with Web::Scraper, and I found that I mostly give it XPath syntax, which it handles fairly well, even with tagsoup.

    I have a talk on Web::Scraper [datenzoo.de] online, but it's in German. The hilarious babelfish translation [66.196.80.202] might provide some shallow entertainment to you though.

  • The LibXML module has a parse_html method that can be used to parse any old crappy HTML. It does tend to spew warnings to STDERR whether you want them or not but you can localise a redirection of STDERR if you don't want them.
  • With HTML::TreeBuilder::XPath you can do xpath-like searches on HTML documents.