Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

davorg (18)

davorg
  dave@dave.org.uk
http://dave.org.uk/
Yahoo! ID: daveorguk (Add User, Send Message)

Hacker, author, trainer

Technorati Profile [technorati.com]

Journal of davorg (18)

Monday September 16, 2002
09:05 AM

Perl vs Java

[ #7752 ]

You may have seen the article Can Java technology beat Perl on its home turf with pattern matching in large files? that there has been some debate about on both #perl and comp.lang.perl.misc today.

One of the biggest criticisms of the article was that the author hasn't published the Perl code that he is comparing his Java with.

I emailed the author (found his email address thru a Google search) and pointed out the unfairness of this comparison. With half an hour I got a reply from him including the Perl code.

So here it is. Feel free to optimise it.

#!/home/hoffie/bin/perl
@sunIPs=("192\\.9\\.","192\\.18\\.","192\\.29\\.");
@f ileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS");
$filename="$ARG V[0]";
open(IN,$filename) || die "cannot open $ARGV[0] for reading: $!";
open(OUT,">$filename.out") || die "cannot open $filename.out for writing: $!";
LINE: while(<IN>) {
     foreach $fileext (@fileext) {
         next LINE if ($_ =~ /$fileext HTTP/);
     }
     foreach $sunIP (@sunIPs) {
         next LINE if ($_ =~ /^$sunIP/);
     }
     print OUT;
}

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  •      foreach $fileext (@fileext) {
             next LINE if ($_ =~ /$fileext HTTP/);
         }
         foreach $sunIP (@sunIPs) {
             next LINE if ($_ =~ /^$sunIP/);
         }
    Yeah, it's almost always possible to beat bad Perl written by people who don't understand that regexes need to be compiled.
    --
    • Randal L. Schwartz
    • Stonehenge
  • Except that he's not really pattern matching. He's using Java's index-like method. And he's "unrolled" his loops within the read-loop.

    His perl is idiomatic (except for the spurious =~'s) and looks just like any novice would have written it.

    If I had enough data I might take a crack at unrolling it and making it quicker. Like any "benchmark" though, the code can *always* be manipulated to favor one over the other.
  • Simple first pass at it: using qr//:

    #!/home/hoffie/bin/perl
    @sunIPs=("192\\.9\\.","192\\.18\\.","192\\.29\\.");
    @f ileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS");
    $filename="$ARG V[0]";
    open(IN,$filename) || die "cannot open $ARGV[0] for reading: $!";
    open(OUT,">$filename.out") || die "cannot open $filename.out for writing: $!";

    # compile once
    $fileext = join '|', @fileext;
    $fileext = qr/(?:$fileext) HTTP/;
    $sunIPs = join '|', @sunIPs;
    $sunIPs = qr/^(?:$sunIPs)/;

    LINE: while(<IN>) {

  • Well, he's not using Java regexes so it's not a fair comparison. But anyway, I'd like to point out that the second edition of Mastering Regular Expressions is fantastic. It goes into great detail on the new features and relative speeds of the regular expression engines in all the languages, and is generally very cool indeed.
  • my $filename = shift;
    open(IN,$filename) || die "cannot open $filename for reading: $!";
    open(OUT,">$filename.out") || die "cannot open $filename.out for writing: $!";

    while ( <IN> ) {
        next if /\.(gif|jpg|css|GIF|JPG|CSS) HTTP/;
        next if /192\.(9|18|29)\./;
        print OUT;
    }
    Not sure about the execution speed of the regexes, but it's a damn sight easier to read.
    --

    --
    xoa

    • Tests? Should be /^192 and it should go faster if you use /o on the regexen.

      But, yeah, much easier to read, much faster to write, and much better.

      Hmm. I think I should ask pudge to make <tt> text a different colour.
      --
        ---ict / Spoon
      • it should go faster if you use /o on the regexen.

        No it won't. The /o only applies to regexes that are based on variables, as in:

        my $pattern = "192\.(whatever)";
        if ( $foo =~ /$pattern/o )
        That's the ONLY time that /o applies.
        --

        --
        xoa

    • Without wishing to wave the golf stick, may I commend
      #!perl -pi.out
      $_ = '' unless /(?:\.(?:gif|jpg|css|GIF|JPG|CSS)[ ]HTTP |
                         192\.(?:9|18|29)\.)/x
      to the house?
  • What underutilized lacky has enough time to worry about making a program that runs in 283 seconds BUT TAKES 5 MINUTES TO WRITE into a program that runs in 137 seconds BUT TAKES 15-30 minutes to write. If the program could be rewritten so that it runs under 10 seconds (my attention span), THEN the extra effort *might* be worth it. This program is likely to be run from a batch job so that a hyoo-mon isn't likely to be at the terminal waiting for it to finish.

    Java cuts into my beer-drinking time.

  • Uh... (Score:4, Insightful)

    by jhi (318) <jhi@iki.fi> on 2002.09.16 10:45 (#12881) Homepage Journal
    (As pointed out by many, already...)

    (1) The Perl code is really bad. Just replacing the "loop-over-each-line-recompiling-the-regex-each-time" by moving the loop invariant regex to the front of
    the while speeds things up.
    (2) Using qr speeds things up further.
    (3) Moving the sunIPs testing before the fileext
    testing speeds things up further.
    (4) Inlining the 192. and HTTP speeds things up.
    Hey, the Java code inlines those strings.

    And after all that is done, we're still comparing apples and oranges: the Java code doesn't do regular expressions. If someone has the time, they might want to ape precisely what the Java code is doing, using index() and so forth, and then measure that.

    I hope someone will write a polite expose of all the things that are wrong (*) with this article, and both post it to whatever forum/editors, and the author. Mind, be polite, professional, and helpful.

    (*) Let me see...
    (a) comparing apples and oranges
    (b) the Perl code not published in the article
    (c) the Perl code is very bad
    (d) the input data not available

    I won't comment on the Java code itself, I'll leave that to people who do more Java, except that noting that it inlines the filtering data, as opposed to the Perl code which at least has it cleanly separated into variable.