Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I’m not sure I understand what your code is doing, so I’ll code to your specification instead…

    sub compress_body {
        my ( $html ) = @_;
        my $parser = HTML::TokeParser::Simple->new( \$html );
        my $text = '';
        my $in_body;

        1 while $_ = $parser->get_token and not $_->is_start_tag( 'body' );

        while ( my $token = $parser->get_token ) {
            if( $token->is_text ) {
         

    • The problem is, it gets passed an entire HTML document and has to preserve everything up to the first body tag (inclusive) and after the last body tag (also inclusive).

      • Ah, now suddenly all your contortions make sense.

        sub compress_body {
            my ( $html ) = @_;
            my $parser = HTML::TokeParser::Simple->new(\$html);
            my $out = '';
            my $body = '';

            while ( my $token = $parser->get_token ) {
                if( $token->is_start_tag('body') .. $token->is_end_tag('body') ) {
                    $out .= $token->as_is if $token->is_tag( 'body' );
                    $body .= $token->as_is if $token->is_text;
                }
                else {
                    if( length( $body ) ) {
                        $out .= substr $body, 0, 1000;
                        $body = '';
                    }
                    $out .= $token->as_is;
                }
            }

            return $out . substr $body, 0, 1000;
        }

        A bit more repetition than I’d like. Maybe

        sub compress_body {
            my ( $html ) = @_;
            my $parser = HTML::TokeParser::Simple->new(\$html);
            my $out = '';
            my $in_body;

            while ( my $token = $parser->get_token ) {
                if( $in_body ) {
                    my $body = '';
                    {
                        last if $token->is_end_tag( 'body' );
                        $body .= $token->as_is if $token->is_text;
                        redo if $token = $parser->get_token;
                    }
                    $out .= substr $body, 0, 1000;
                    undef $in_body;
                }

                $in_body = $token->is_start_tag( 'body' );
                $out .= $token->as_is;
            }

            return $out;
        }

        Better… in a way… I think.