Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • No need to gather a larger body, only to truncate it...
    • Randal L. Schwartz
    • Stonehenge
  • How about: my $compressed_body = compress_body($html);

    My attempt at humor is meant to indicate the upside here:

    At least your annoying code is in a routine, so you have made the functionality less annoying for everything that uses it.

  • [% FILTER truncate(1000) %] really long text from example... [% END %]
  • I’m not sure I understand what your code is doing, so I’ll code to your specification instead…

    sub compress_body {
        my ( $html ) = @_;
        my $parser = HTML::TokeParser::Simple->new( \$html );
        my $text = '';
        my $in_body;

        1 while $_ = $parser->get_token and not $_->is_start_tag( 'body' );

        while ( my $token = $parser->get_token ) {
            if( $token->is_text ) {

    • That is, minus the $in_body I accidentally left in there.

      Oh, and my own preferred solution would be to load the thing with libxml’s HTML parsing mode, then do $dom->findvalue( 'substring( /html/body, 0, 1000 )' ). XPath rocks.

    • The problem is, it gets passed an entire HTML document and has to preserve everything up to the first body tag (inclusive) and after the last body tag (also inclusive).

      • Ah, now suddenly all your contortions make sense.

        sub compress_body {
            my ( $html ) = @_;
            my $parser = HTML::TokeParser::Simple->new(\$html);
            my $out = '';
            my $body = '';

            while ( my $token = $parser->get_token ) {
                if( $token->is_start_tag('body') .. $token->is_end_tag('body') ) {
                    $out .= $token->as_is if $token->is_tag( 'body' );