A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.
Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.
For instance, if you have an HTML
<span class=".entry-date">2007-10-04T01:09:44-0800</span>
you can get the DateTime object that the string represents, like:
process ".entry-date", "date" => sub {
DateTime::Format::W3CDTF->parse_string(shift->as_text);
};
and with 'filters' you can make this reusable and stackable, like:
package Web::Scraper::Filter::W3CDTFDate;
use base qw( Web::Scraper::Filter );
use DateTime::Format::W3CDTF;
sub filter {
DateTime::Format::W3CDTF->parse_string($_[1]);
}
1;
and then:
process ".entry-date", date => [ 'TEXT', 'W3CDTFDate' ];
If the
process ".entry-date", date => [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];
This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.
So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.
However I have another, more ideal solution in my mind.
The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.
And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base
For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for each individual text filter engine.
Doesn't this suck?
I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.
use Text::Filter::Common;
my $filter = Text::Filter::Common->new($name, $config);
my $output = $filter->filter($input, $option);
So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implements filter function that probably takes $self->config to configure the filter object.
Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.
Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.
Thoughts?
Text::Pipe? (Score:1)
And I wouldn't put the factory in a
Several pipe segments, piped together, could themselves be pipe segments.
Text::Pipe::*
Re: (Score:2)
filterinterface, you don't need to.Creating a stacked pipe is easy by creating a new Pipe stacker object, like:
Re: (Score:1)
my $stacked_pipe = Text::Pipe::Stackable->new($pipe1, $pipe2, $pipe3);
Yes, that's a better design pattern. In that case, Text::Pipe::Stackable->new() should be able to take both individual segments as well as Text::Pipe::Stackable objects as well (for a kind of recursive construction).
That is, stacked pipes should - to the user - be indistinguishable from individual pipe segments. It's just some black hole that has an i
Re: (Score:2)
You mean like Formatter (Score:1)
I think its a good idea (Score:1)
I also think there might be some benefit to providing some more pluggable basic formatters, like html or other text markup, where other formatters can hook into the apropriate place. I guess Web::Scraper is already like that in a way.