Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Thursday October 04, 2007
03:20 AM

Web::Scraper with filters, and thought about Text filters

[ #34607 ]

A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.

Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.

For instance, if you have an HTML

<span class=".entry-date">2007-10-04T01:09:44-0800</span>

you can get the DateTime object that the string represents, like:

  process ".entry-date", "date" => sub {
    DateTime::Format::W3CDTF->parse_string(shift->as_text);
  };

and with 'filters' you can make this reusable and stackable, like:

package Web::Scraper::Filter::W3CDTFDate;
use base qw( Web::Scraper::Filter );
use DateTime::Format::W3CDTF;
 
sub filter {
    DateTime::Format::W3CDTF->parse_string($_[1]);
}
1;

and then:

  process ".entry-date", date => [ 'TEXT', 'W3CDTFDate' ];

If the .entry-date text contains errorneous spaces, you can do:

  process ".entry-date", date => [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];

This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.

So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.

However I have another, more ideal solution in my mind.

The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.

And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base ... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.

For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for each individual text filter engine.

Doesn't this suck?

I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.

use Text::Filter::Common;
my $filter = Text::Filter::Common->new($name, $config);
my $output = $filter->filter($input, $option);

So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implements filter function that probably takes $self->config to configure the filter object.

Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.

Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.

Thoughts?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I like the "more ideal" solution of having separate text filters. Since Text::Filter is taken, how about Text::Pipe? After all, the factory method shouldn't be able to just give you one filter, but several filters, piped together.

    And I wouldn't put the factory in a ::Common module; just call it Text::Pipe::Factory. It generates "pipe segments" that are Text::Pipe::* objects, all of which are derived from Text::Pipe::Base.

    Several pipe segments, piped together, could themselves be pipe segments.

    Text::Pipe::*
    • I don't care much about names, but I disagree letting Text::Pipe itself have the stackable several filters becasue all filters have the same single filter interface, you don't need to.

      Creating a stacked pipe is easy by creating a new Pipe stacker object, like:

      use Text::Pipe::Stackable;
      use Text::Pipe;
       
      my $pipe1 = Text::Pipe->new('foo');
      my $pipe2 = Text::Pipe->new('bar');
      my $pipe3 = Text::Pipe->new('baz');
       
      my $stacked_pipe = Text::Pipe::Stackable->new($pipe1, $pipe2, $pipe3);
       

      • Agreed re bike-shed discussion; one more point though:

                my $stacked_pipe = Text::Pipe::Stackable->new($pipe1, $pipe2, $pipe3);

        Yes, that's a better design pattern. In that case, Text::Pipe::Stackable->new() should be able to take both individual segments as well as Text::Pipe::Stackable objects as well (for a kind of recursive construction).

        That is, stacked pipes should - to the user - be indistinguishable from individual pipe segments. It's just some black hole that has an i
  • Kjetil has this same issue several years ago with his Wiki software. His solution was http://search.cpan.org/~kjetilk/Formatter-0.95/ [cpan.org]
  • I've been thinking similar thoughts while working on the formatter chain of mojomojo. I think there might be two distinct types of formatters tho, formatters that can work on streams, and formatters that work on distinct pieces of content.

    I also think there might be some benefit to providing some more pluggable basic formatters, like html or other text markup, where other formatters can hook into the apropriate place. I guess Web::Scraper is already like that in a way.