Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Web::Scraper with filters, and thought about Text filters

Journal written by miyagawa (1653) and posted by brian_d_foy on 2007.10.04 16:37   Printer-friendly
A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.

Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.

For instance, if you have an HTML

<span class=".entry-date">2007-10-04T01:09:44-0800</span>
you can get the DateTime object that the string represents, like:

  process ".entry-date", "date" => sub {
    DateTime::Format::W3CDTF->parse_string(shift->as_text);
  };
and with 'filters' you can make this reusable and stackable, like:

package Web::Scraper::Filter::W3CDTFDate;
use base qw( Web::Scraper::Filter );
use DateTime::Format::W3CDTF;
 
sub filter {
    DateTime::Format::W3CDTF->parse_string($_[1]);
}
1;
and then:

  process ".entry-date", date => [ 'TEXT', 'W3CDTFDate' ];
If the .entry-date text contains errorneous spaces, you can do:

  process ".entry-date", date => [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];
This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.

So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.

However I have another, more ideal solution in my mind.

The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.

And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base ... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.

For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for each individual text filter engine.

Doesn't this suck?

I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.

use Text::Filter::Common;
my $filter = Text::Filter::Common->new($name, $config);
my $output = $filter->filter($input, $option);
So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implements filter function that probably takes $self->config to configure the filter object.

Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.

Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.

Thoughts?
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Text::Pipe?

    (Score:1)
    by hanekomu (8123) on 2007.10.04 4:48 (#58203)
    ( http://hanekomu.vox.com/ | Last Journal: 2007.09.16 9:19 )
    I like the "more ideal" solution of having separate text filters. Since Text::Filter is taken, how about Text::Pipe? After all, the factory method shouldn't be able to just give you one filter, but several filters, piped together.

    And I wouldn't put the factory in a ::Common module; just call it Text::Pipe::Factory. It generates "pipe segments" that are Text::Pipe::* objects, all of which are derived from Text::Pipe::Base.

    Several pipe segments, piped together, could themselves be pipe segments.

    Text::Pipe::* objects could have '|' overloaded so you can combine them in a TT-like syntax.

    Then again, maybe you don't want to limit yourself to piping text; how about arbitrary data structures? E.g., one pipe segment could take an array and reduce(). But maybe that's going too far. (I've had such an idea many years ago but didn't follow up on it.) Piping text is fine.

    Your example regex that deals with erroneous spaces would itself by a pipe segment, something like Text::Pipe::Trim.
  • Kjetil has this same issue several years ago with his Wiki software. His solution was http://search.cpan.org/~kjetilk/Formatter-0.95/ [cpan.org]
  • I've been thinking similar thoughts while working on the formatter chain of mojomojo. I think there might be two distinct types of formatters tho, formatters that can work on streams, and formatters that work on distinct pieces of content.

    I also think there might be some benefit to providing some more pluggable basic formatters, like html or other text markup, where other formatters can hook into the apropriate place. I guess Web::Scraper is already like that in a way.