Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Friday August 10, 2007
12:25 AM

HTML Tree (DOM) + XPath = Element. The other way round?

[ #34070 ]

Modules like HTML::TreeBuilder::XPath and HTML::Selector::XPath is very useful to extract content from HTML DOM tree using XPath expressions or CSS selectors. These modules do the following:

HTML DOM Tree + XPath expression => The element you want

Is there an other way round to do this? I mean,

HTML DOM Tree + The element you want => XPath expression

I know Mozilla extension allows to do this with GUI, but it's well known that the generated XPath is kinda bogus because it adds extra tbody etc. and useless when you don't use Gecko engine.

The module would share the concept with Template::Extract, which does creation of TT templates using stash variales and the generated output.

If anyone knows the prior work to do this, let me know. Otherwise I'll begin writing a module for it, to make using Web::Scraper much easier. It'd be nice to add to my YAPC::EU talk.

And yes, all problems regarding my flight and hotel in Vienna seem to be sorted and I'll be in. Yay!
 

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Are you asking about XML::LibXML::Node [cpan.org]’s nodePath method?

    • More or less what I was going to suggest.

      Keep in mind that 'nodePath' will return something like:

      /html/body/div[3]/table/tr[2]/td[5]

      Which, while correct, might not be the most flexible specification... maybe you really wanted:

      /html/body/div[h2='The table']/table/tr[td[1]='this row']/td[position()=../../tr[1]/td[.='this column']/position()]
      • Exactly. That's what I don't like with Mozilla extension way too.

        I might want the module to generate multiple possible XPath expressions so that the user can pick, to generate the scraper thing that's most reliable.
        • You’ll run into combinatorial explosion for even a relatively short path. There are extremely many ways to address a single element.

          I guess what you want, given your comparison with Template::Extract, is a way to accept multiple nodes and then ask for the strictest possible XPath expression (including shared attribute values on any ancestral elements etc) that matches them all.

          Hmm, that would be cool.

    • Yeah, this is quite similar to what I have in mind, except it's libxml based (I want one for HTML::Tree for some reason). But it'll be definitely helpful. Thank you!