Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Monday November 26, 2007
05:26 PM

Web::Scraper talk in SF.pm lightning talks 11/27

I'm gonna give a 5 minute brief talk about Web::Scraper in SF.pm meeting tomorrow night (11/27 7pm SOMA).

It appears that you need to be a member of SF.pm mailing list to attend to the meeting due to the venue policy etc., but if you wanna join, let me know so I can talk to the organizer!

Sunday November 25, 2007
06:37 PM

Web::Scraper (HTML::TreeBuilder::XPath) slowdown on Fedora

Today I had an interesting report from Web::Scraper user, saying that he has a script that runs really quick (less than 1 sec) on Macbook but so slow (50 secs) on AMD dual CPU machine. Here's the dprof report:


Total Elapsed Time = 47.32165 Seconds
    User+System Time = 31.07165 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
  51.6 16.03 16.033 6922 0.0023 0.0023 XML::XPathEngine::NodeSet::new
  13.5 4.208 4.208 1777 0.0024 0.0024 XML::XPathEngine::Boolean::True
  13.0 4.048 4.048 1723 0.0023 0.0023 XML::XPathEngine::Literal::new
  11.3 3.518 3.518 1666 0.0021 0.0021 XML::XPathEngine::Boolean::False

We initially thought it's due to some XS module library issues with dual CPU, but it turned out he was using perl that comes with Fedora, and the rpm version he uses is 5.8.8-10.

As addressed in RH/Fedora bugzilla, perl 5.8.8 rpm prior to 5.8.8-22 has a nasty patch that makes all perl's new() (or bless) call in classes with overloaded methods really slow. HTML::TreeBuilder::XPath (hence Web::Scraper) creates a lot of Nodes on HTML pages and XML::XPathEngine::NodeSet definitely has an overloaded function.

So this is really due to Fedora Perl's patch. If you run into the same issue with Fedora, check your rpm version and upgrade to the latest, or build your own perl which is always a good thing.

Monday November 19, 2007
01:23 PM

Web::Scraper recipe: download substitles from wikisubtitles

This extracts subtitle links from WikiSubtitles Ajax episode links.

#!/usr/bin/perl
use strict;
use Web::Scraper;
use URI;
 
my $uri = URI->new("http://wikisubtitles.net/ajax_loadShow.php?show=65&season=3");
my $scraper = scraper {
    process '//td[@class="idioma"][text()=~"English \(US\)"]/..//a', 'links[]' => '@href';
};
my $result = $scraper->scrape($uri);

You can paste the URLs to Speeddownload and it's now all set!

Saturday November 10, 2007
11:48 PM

Web::Scraper now has nth-child(N) support in CSS selector

Thanks to tokuhirom, HTML::Selector::XPath now has added a support of nth-child(N) CSS selector. Hence Web::Scraper can make use of it as well.

The new release 0.03 is going to CPAN mirrors shortly.

Wednesday November 07, 2007
01:49 AM

Tagging CPAN changes

Question: Is it possible to annotate/tag each CPAN module update so that we can figure out if the update contains "security fix", "minor bug fix" or "major API change" etc.?

Context: At work we have a repository of third party CPAN modules that we use on Vox or TypePad. Once a module is added to the list, we manually follow the changes of each module to figure out if we need to upgrade (ala fix for major bugs, security issues, memory leaks etc.) or not to upgrade (ala backward incompatible API changes etc.)

It generally works well but sometimes we upgrade a module without knowing that it might break our code. In that case we take a look at how hard it is to update our code to follow the module change, and if it's not that easy, we simply revert the upgrade.

So, I think it's nice if we can automatically or even semi-automatically know, given module XXX-YYY version M to N, what kind of changes the upgrade will contain, without manually looking at Changes and diffing its source code. Note that I'm not saying these audit processes are worthless, but if we know what amount of change the upgrade introduces, it makes the work a bit easier.

Here are two possible solutions:

1) Having a rough standard to indicate these "minor bug fix", "security fix" or "major API change" type of thing in Changes file.

I know CPAN is not a place that we can force all module authors to follow one giant "standard", but we already have some kind of standardization on CPAN modules versioning: if the release is a developer release that "normal" user shouldn't upgrade, we add "_" in the version number so CPAN ecosystem will ignore it. Could we introduce more things similar to this, to tag each module update?

I realize that it's not easy because most authors write Changes file in a free text format. Some authors use more structured formats like YAML, POD or n3/RDF(!), but I myself don't like to do that. Hm, maybe YAML is accetable.

Anyway, if that doesn't sound realistic, I have another solution in my mind, 2) to have a Wiki/del.icio.us-like website where anyone can tag any module release. It might sound a bit more Web 2.0 way to accomplish the original purpose :)

We probably want to integrate the user authentication with PAUSE/BitCard so that we can say "this release is tagged 'minor bug fix' by the author."

Thoughts?

Thursday October 25, 2007
02:18 PM

Web::Scraper hacks #3: Read your browser's cookies

Some websites require you to login to the site using your credential, to view the content. It's easily scriptable with WWW::Mechanize, but if you visit the site frequently with your browser, why not reusing the browser's cookies, so as you don't need to script the login process?

Web::Scraper allows you to call methods, or entirely swap its UserAgent object when it scrapes the website. Here's how to do so:

use Web::Scraper;
use HTTP::Cookies::Guess;
 
my $cookie_jar = HTTP::Cookies::Guess->create(file => "/home/miyagawa/.mozilla/cookies.txt");
my $s = scraper { };
$s->user_agent->cookie_jar($cookie_jar);
$s->scrape($uri);

This snippet uses HTTP::Cookies::Guess which provides you a common API to read browser's cookie files (the module supports IE, Firefox, Safari and w3m) and set the cookie jar to the UserAgent object.

If you'd like to change the behavior globally, you can also do:

$Web::Scraper::UserAgent->cookie_jar($cookie_jar);

In either way, you can avoid coding your username and password in the scraping script, which is a huge win.

Wednesday October 17, 2007
08:38 PM

Better CPAN RSS feed

search.cpan.org has an RSS feed for recently uploaded modules but there's only one minor problem: the feed doesn't have rich metadata.

Daisuke Murase (aka typester on CPAN and IRC) created a site called CPAN Recent Changes a while ago and it's been really useful for people tracking activities on CPAN.

The feature the site provides is very simple: "a better recent change log for CPAN". The site tracks the recently uploaded modules from search.cpan.org and grabs Changes file and takes diff against the previous version, so you can see what's changed in the release (unless the author is too lazy to update the Changes file). The site of course publishes the RSS feed for the recent uploads to CPAN, with the changes in the summary field, so you can keep an eye on it without clicking the link to see what's changed.

You can also follow changes in different views, like modules under the specific namespaces (e.g. Catalyst or DBIx-Class) or modules uploaded by specific authors (e.g. me or Ingy) and they all come with RSS feeds too.

Saturday October 13, 2007
02:59 PM

YAPC::Hawaii

I've been dreaming (with a couple of folks like clkao) about having a YAPC in Hawaii. Hawaii is a great place for everyone to come, from the west/mid coast of USA, East Asia (Japan, Taiwan) and Oceania (Australia, NZ). It's gonna be a great place for attendees to bring their wives and GFs. The conference would begin early morning and should finish like 3pm so we can get to the beach.

Since Hawaii is not part of the North America continent, it shouldn't be YAPC::NA. YAPC::Pacific.

I don't know where to start. Is there a Perl monger in Hawaii? Hawaii.pm.org is there but the site seems way outdated.

Friday October 12, 2007
02:15 PM

Regexp::Debug?

Lazyweb,

Is there a module to debug your regular expression, to compare the target string and an input regular expression one byte by one? It'd be useful if you have an existent code to do a pattern match against a big chunk of string and don't know why it doesn't match.

use Regexp::Debug;
 
my $string = "abcdefg";
my $regexp = qr/abcefg/; # Notice 'd' is missing
 
my $result = Regexp::Debug->compare($string, $regexp);
 
# $result would be an object or a string to
# indicate that the regexp stopped matching at 'abc'

It's the thing we regularly do, when the regular expression based screen scraping tool (BAD! Use Web::Scraper instead!) stops working. I open up 2 terminal screens, one with HTML output and one with regular expression. In the worst case I split the regular expression in a binary tree search fashion to find where it's broken.

Thursday October 04, 2007
03:20 AM

Web::Scraper with filters, and thought about Text filters

A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.

Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.

For instance, if you have an HTML

<span class=".entry-date">2007-10-04T01:09:44-0800</span>

you can get the DateTime object that the string represents, like:

  process ".entry-date", "date" => sub {
    DateTime::Format::W3CDTF->parse_string(shift->as_text);
  };

and with 'filters' you can make this reusable and stackable, like:

package Web::Scraper::Filter::W3CDTFDate;
use base qw( Web::Scraper::Filter );
use DateTime::Format::W3CDTF;
 
sub filter {
    DateTime::Format::W3CDTF->parse_string($_[1]);
}
1;

and then:

  process ".entry-date", date => [ 'TEXT', 'W3CDTFDate' ];

If the .entry-date text contains errorneous spaces, you can do:

  process ".entry-date", date => [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];

This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.

So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.

However I have another, more ideal solution in my mind.

The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.

And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base ... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.

For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for each individual text filter engine.

Doesn't this suck?

I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.

use Text::Filter::Common;
my $filter = Text::Filter::Common->new($name, $config);
my $output = $filter->filter($input, $option);

So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implements filter function that probably takes $self->config to configure the filter object.

Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.

Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.

Thoughts?