Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

Ovid
  (email not shown publicly)
http://publius-ovidius.livejournal.com/
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Tuesday July 01, 2003
02:19 PM

Just try this in Java, I dare you ...

[ #13182 ]

The problem: he had a ton of HTML that was written by MS Excel. As a result, all of the HTML was upper-case and it had MS proprietary style information embedded in the tags. He needed to clean this up, fast. He didn't have Perl on his box, but five minutes later, he accessed a URL that pointed to this script that I wrote. Paste the HTML in the textarea, click submit and it's instantly cleaned.

#!/usr/bin/perl -T
use strict;
use warnings;
use HTML::TokeParser::Simple 2.1;
use CGI qw(:standard);

my $new_html = param('html') ? clean_html() : '';
param(-name => 'html', -value => $new_html);

print header,
    start_html('Clean html'),
    start_form,
    textarea('html', '', 10, 50 ),
    submit,
    end_form,
    end_html;

sub clean_html {
    my $new_html = '';
    my $html     = param('html');
    my $parser   = HTML::TokeParser::Simple->new(\$html);
    while (my $token = $parser->get_token) {
        $token->delete_attr('style') if $token->is_start_tag;
        $token->rewrite_tag;
        $new_html .= $token->as_is;
    }
    return $new_html;
}

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Hmmm it is a neat script and I often have that experience that is does not take many lines to do something in Perl.

    But is the comparison to Java completely fair? Your script does not tell the whole story, since HTML::TokeParser::Simple (which does most of the job) is not shown.

    How would a similar program look in Java if a class of similar functionality existed in Java?

    ovid.Simple.HTML.TokeParser

    That would of course mean that we would also have to examine the modules and classes used by either of these
    • I also think this is a very elegant little script.

      Given that the problem is to Get the Job Done(tm), I'd say it's relevant whether HTML::TokeParse functionality is available in Java.

      Some or all of the functionality may be, but probably in a slightly more verbose version. And probably not as easily found, installed and used. As always, having done similar things before is an important factor (me, I wouldn't have the experience to look at HTML::TokeParser for this problem, but I might find it anyway).

      BTW,