use Perl Log In
Web::Scraper with filters, and thought about Text filters
A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.
Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.
For instance, if you have an HTML
Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.
For instance, if you have an HTML
you can get the DateTime object that the string represents, like:<span class=".entry-date">2007-10-04T01:09:44-0800</span>
process ".entry-date", "date" => sub {
DateTime::Format::W3CDTF->parse_string(shift->as_text);
};
and with 'filters' you can make this reusable and stackable, like:
So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.
However I have another, more ideal solution in my mind.
The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.
And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base ... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.
For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for each individual text filter engine.
Doesn't this suck?
I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.
Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.
Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.
Thoughts?
and then:package Web::Scraper::Filter::W3CDTFDate;
use base qw( Web::Scraper::Filter );
use DateTime::Format::W3CDTF;
sub filter {
DateTime::Format::W3CDTF->parse_string($_[1]);
}
1;
If the .entry-date text contains errorneous spaces, you can do:process ".entry-date", date => [ 'TEXT', 'W3CDTFDate' ];
This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.process ".entry-date", date => [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];
So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.
However I have another, more ideal solution in my mind.
The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.
And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base ... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.
For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for each individual text filter engine.
Doesn't this suck?
I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.
So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implementsuse Text::Filter::Common;
my $filter = Text::Filter::Common->new($name, $config);
my $output = $filter->filter($input, $option);
filter function that probably takes $self->config to configure the filter object.Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.
Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.
Thoughts?
Web::Scraper with filters, and thought about Text filters
|
Log In/Create an Account
| Top
| 6 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Text::Pipe?
(Score:1)( http://hanekomu.vox.com/ | Last Journal: 2007.09.16 9:19 )
And I wouldn't put the factory in a
Several pipe segments, piped together, could themselves be pipe segments.
Text::Pipe::* objects could have '|' overloaded so you can combine them in a TT-like syntax.
Then again, maybe you don't want to limit yourself to piping text; how about arbitrary data structures? E.g., one pipe segment could take an array and reduce(). But maybe that's going too far. (I've had such an idea many years ago but didn't follow up on it.) Piping text is fine.
Your example regex that deals with erroneous spaces would itself by a pipe segment, something like Text::Pipe::Trim.
You mean like Formatter
(Score:1)( http://chris.prather.org/ | Last Journal: 2007.08.13 23:28 )
I think its a good idea
(Score:1)( http://thefeed.no/marcus/ | Last Journal: 2006.04.06 18:03 )
I also think there might be some benefit to providing some more pluggable basic formatters, like html or other text markup, where other formatters can hook into the apropriate place. I guess Web::Scraper is already like that in a way.