Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

  (email not shown publicly)
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Tuesday October 25, 2005
03:33 PM

Annoying problems get annoying solutions

[ #27311 ]

The problem: take the HTML, strip everything which is not text from between the body tags and truncate the body at 1000 characters, regardless of whether or not I'm chopping a word in half.

sub compress_body {
    my $html   = shift;
    my $parser = HTML::TokeParser::Simple->new(\$html);

    my ($header, $body, $footer) = ('', '', '');

    my $curr_output = \$header;

    while ( my $token = $parser->get_token ) {
        $curr_output = \$footer if $token->is_end_tag('body');

        # skip non-text in body
        if ( $curr_output eq \$body ) {
            next unless $token->is_text;
        $$curr_output .= $token->as_is;

        $curr_output = \$body if $token->is_start_tag('body');

    return join "\n" => $header, substr($body, 0, 1000), $footer;

Can you think of anything simpler? I think the above code is just flat out ugly.

Update: cleaned it up just a tad even though the logic is the same. I also hate how things are dependent on the order.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • No need to gather a larger body, only to truncate it...
    • Randal L. Schwartz
    • Stonehenge
  • How about: my $compressed_body = compress_body($html);

    My attempt at humor is meant to indicate the upside here:

    At least your annoying code is in a routine, so you have made the functionality less annoying for everything that uses it.

  • [% FILTER truncate(1000) %] really long text from example... [% END %]
  • I’m not sure I understand what your code is doing, so I’ll code to your specification instead…

    sub compress_body {
        my ( $html ) = @_;
        my $parser = HTML::TokeParser::Simple->new( \$html );
        my $text = '';
        my $in_body;

        1 while $_ = $parser->get_token and not $_->is_start_tag( 'body' );

        while ( my $token = $parser->get_token ) {
            if( $token->is_text ) {

    • That is, minus the $in_body I accidentally left in there.

      Oh, and my own preferred solution would be to load the thing with libxml’s HTML parsing mode, then do $dom->findvalue( 'substring( /html/body, 0, 1000 )' ). XPath rocks.

    • The problem is, it gets passed an entire HTML document and has to preserve everything up to the first body tag (inclusive) and after the last body tag (also inclusive).

      • Ah, now suddenly all your contortions make sense.

        sub compress_body {
            my ( $html ) = @_;
            my $parser = HTML::TokeParser::Simple->new(\$html);
            my $out = '';
            my $body = '';

            while ( my $token = $parser->get_token ) {
                if( $token->is_start_tag('body') .. $token->is_end_tag('body') ) {
                    $out .= $token->as_is if $token->is_tag( 'body' );