Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

  (email not shown publicly)
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Wednesday November 28, 2007
03:15 AM lightning talk

[ #34992 ]
So I went down to meeting and gave two lightning talks about Web::Scraper and takesako-san's neat IMG tag hackery. These talks went well and other talks were interesting too. Photos uploaded to Flickr tagged
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • i am just going to do some scraping work and W::S works great so far. the doc is lacking though, the examples you posted in past journal helped! have few questions though:

    1. the example from the doc has:
      process "h3.ens>a",
      where the ens seems to be doing wildcard matching, any class name contains ens.
    2. html page contains utf8 characters such as è , that made HTML::Parser complain.
      Parsing of undecoded UTF-8 will give garbage when decoding entities
      HTML::Parser mentioned encoding the data
    • 1. If you want a wildcard matching you can change the selector expression to something like ".ens>a" 2. Web::Scraper does whatever it can do to decode utf-8 characters back to Unicode as possible, as long as you pass the URI object and the HTML page has a correct Content-Type header. Otherwise you need to fetch the page into a variable and call Encode::decode to get the Unicode character back. 3. result keyword can specify which stash variable you want to get as a result. You can omit it if you want th
      • .ens>a does that matching any class name contain the string 'ens'? what is the syntax for exact matching on a classname then?
        • No, ".ens>a" does exact match. Or in other words, exact match with class name. If you want to match partial class names, you might need to do a[@class=~"ens"] or something like that. Read CSS Selector spec [] for details.
          • Should be a[class~="ens"] that is.
          • No, actually, “.ens > a” matches an “a” element inside an element of any name with class “ens”, whereas “a[class~="ens"]” wants to see the class on the “a” element itself. The partial-match version would actually be “*[class~="ens"] > a”.

            • Eh, i didn't look at the original question very well. The point he didn't get was class="foo bar" is foo + bar and not "foo bar". Anyway.
        • er. my bad. i thought class="listing first" is one class name. it is 'listing' and 'first'.

          great module, thanks!