Stories
Slash Boxes
Comments

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Thursday October 04, 2007
03:20 AM

Web::Scraper with filters, and thought about Text filters

[ #34607 ]

A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.

Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.

For instance, if you have an HTML

<span class=".entry-date">2007-10-04T01:09:44-0800</span>

you can get the DateTime object that the string represents, like:

  process ".entry-date", "date" => sub {
    DateTime::Format::W3CDTF->parse_string(shift->as_text);
  };

and with 'filters' you can make this reusable and stackable, like:

package Web::Scraper::Filter::W3CDTFDate;
use base qw( Web::Scraper::Filter );
use DateTime::Format::W3CDTF;
 
sub filter {
    DateTime::Format::W3CDTF->parse_string($_[1]);
}
1;

and then:

  process ".entry-date", date => [ 'TEXT', 'W3CDTFDate' ];

If the .entry-date text contains errorneous spaces, you can do:

  process ".entry-date", date => [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];

This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.

So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.

However I have another, more ideal solution in my mind.

The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.

And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base ... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.

For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for each individual text filter engine.

Doesn't this suck?

I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.

use Text::Filter::Common;
my $filter = Text::Filter::Common->new($name, $config);
my $output = $filter->filter($input, $option);

So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implements filter function that probably takes $self->config to configure the filter object.

Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.

Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.

Thoughts?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I like the "more ideal" solution of having separate text filters. Since Text::Filter is taken, how about Text::Pipe? After all, the factory method shouldn't be able to just give you one filter, but several filters, piped together.

    And I wouldn't put the factory in a ::Common module; just call it Text::Pipe::Factory. It generates "pipe segments" that are Text::Pipe::* objects, all of which are derived from Text::Pipe::Base.

    Several pipe segments, piped together, could themselves be pipe segments.

    Text::Pipe::*
    • I don't care much about names, but I disagree letting Text::Pipe itself have the stackable several filters becasue all filters have the same single filter interface, you don't need to.

      Creating a stacked pipe is easy by creating a new Pipe stacker object, like:

      use Text::Pipe::Stackable;
      use Text::Pipe;
       
      my $pipe1 = Text::Pipe->new('foo');
      my $pipe2 = Text::Pipe->new('bar');
      my $pipe3 = Text::Pipe->new('baz');
       
      my $stacked_pipe = Text::Pipe::Stackable->new($pipe1, $pipe2, $pipe3);
       

      • Agreed re bike-shed discussion; one more point though:

                my $stacked_pipe = Text::Pipe::Stackable->new($pipe1, $pipe2, $pipe3);

        Yes, that's a better design pattern. In that case, Text::Pipe::Stackable->new() should be able to take both individual segments as well as Text::Pipe::Stackable objects as well (for a kind of recursive construction).

        That is, stacked pipes should - to the user - be indistinguishable from individual pipe segments. It's just some black hole that has an i
  • Kjetil has this same issue several years ago with his Wiki software. His solution was http://search.cpan.org/~kjetilk/Formatter-0.95/ [cpan.org]
  • I've been thinking similar thoughts while working on the formatter chain of mojomojo. I think there might be two distinct types of formatters tho, formatters that can work on streams, and formatters that work on distinct pieces of content.

    I also think there might be some benefit to providing some more pluggable basic formatters, like html or other text markup, where other formatters can hook into the apropriate place. I guess Web::Scraper is already like that in a way.