Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.
The problem: take the HTML, strip everything which is not text from between the body tags and truncate the body at 1000 characters, regardless of whether or not I'm chopping a word in half.
sub compress_body {
my $html = shift;
my $parser = HTML::TokeParser::Simple->new(\$html);
my ($header, $body, $footer) = ('', '', '');
my $curr_output = \$header;
while ( my $token = $parser->get_token ) {
$curr_output = \$footer if $token->is_end_tag('body');
# skip non-text in body
if ( $curr_output eq \$body ) {
next unless $token->is_text;
}
$$curr_output.= $token->as_is;
$curr_output = \$body if $token->is_start_tag('body');
}
return join "\n" => $header, substr($body, 0, 1000), $footer;
}
Can you think of anything simpler? I think the above code is just flat out ugly.
Update: cleaned it up just a tad even though the logic is the same. I also hate how things are dependent on the order.
return when the body = 1000? (Score:2)
abstraction is always nice (Score:1)
my $compressed_body = compress_body($html);My attempt at humor is meant to indicate the upside here:
At least your annoying code is in a routine, so you have made the functionality less annoying for everything that uses it.
use tt2 (Score:1)
Re: (Score:1)
I’m not sure I understand what your code is doing, so I’ll code to your specification instead…
Re: (Score:1)
That is, minus the
$in_bodyI accidentally left in there.Oh, and my own preferred solution would be to load the thing with libxml’s HTML parsing mode, then do
$dom->findvalue( 'substring( /html/body, 0, 1000 )' ). XPath rocks.Re: (Score:2)
The problem is, it gets passed an entire HTML document and has to preserve everything up to the first body tag (inclusive) and after the last body tag (also inclusive).
Re: (Score:1)
Ah, now suddenly all your contortions make sense.
Re: (Score:1)
Hmm, that has a subtle bug. There has to be a
last if not $token;beforeundef $in_body;.