Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Java vs. Perl

posted by pudge on 2002.09.16 10:15   Printer-friendly
It seems the older Perl gets, the more willing people are to believe that it sucks, without any reasonable facts. davorg writes "You may have seen the article Can Java technology beat Perl on its home turf with pattern matching in large files? that there has been some debate about on both #perl and comp.lang.perl.misc today. One of the biggest criticisms of the article was that the author hasn't published the Perl code that he is comparing his Java with."
"I emailed the author (found his email address thru a Google search) and pointed out the unfairness of this comparison. With half an hour I got a reply from him including the Perl code. So here it is. Feel free to optimise it."
#!/home/hoffie/bin/perl
@sunIPs=("192\\.9\\.","192\\.18\\.","192\\.29\\.");
@f ileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS");
$filename="$ARG V[0]";
open(IN,$filename) || die "cannot open $ARGV[0] for reading: $!";
open(OUT,">$filename.out") || die "cannot open $filename.out for writing: $!";
LINE: while(<IN>) {
    foreach $fileext (@fileext) {
        next LINE if ($_ =~ /$fileext HTTP/);
    }
    foreach $sunIP (@sunIPs) {
        next LINE if ($_ =~ /^$sunIP/);
    }
    print OUT;
}
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Ignoring the speed issue, the java code is 100 lines long vs. 15 for the perl. Though the perl code isn't that well written it is both easier to maintain and extend than the java equivalent.

    I think it would have been less deceptive if the Java code used regular expressions making it more of a fair test.

    • I'm guessing something like this (untested) would be pretty fast:
      if (my $idx = index($_, 'HTTP')) {
          next if $file_ext{substr($_, $idx - 8, 4)};
      }
      if (substr($_, 0, 4) eq '192.') {
          my $n = substr($_, 4, 2);
          next if $n == 9 || $n == 18 || $n == 29;
      }
      • Hrmm....

        Awful coding in both examples.

        For each potential pattern, he's doing a separate check, be it with indexOf(), or with the regex.

        At least in the perl example, the patterns to be skipped are all up at the front of the program, and adding a new exclusion is just a matter of pushing to the arrays.

        [code]
        my @sunIPs = qw(192\.9\. 192\.18\. 192\.29\.);
        my @fileext = qw(\.gif \.GIF \.jpg \.JPG \.css \.CSS);
        my $filename = shift;
        open...yada..yada...
        my $pattern = join('|',@fileext) . "|" . join('|',@sunIPs);
        • Gah... silly fookin' IE. Space button submits with a tab at the wrong time... :/

          My Kingdom for an edit button...

    • Make that one line for Perl :-)

      perl -ne'/(?i:gif|jpg|css) HTTP/|/^192\.(9|18|29)\./||print' filename
      --
      /-\
      • As always, it's important to consider which tool is best for the job at hand. Perl isn't always the best.

        Results of my benchmark:

          crappie Hoffie perl: 106.4 seconds
          reasonably optimal perl: 13.7 seconds
          egrep -vi -f hoffie.egrep: 1.1 seconds

        where hoffie.egrep contains:

            (^(192\.9|192\.18|192\.27))|((\.gif|\.jpg|\.css) HTTP)

        The test data was a file of 1,200,000 lines, of which about half hit the regex.

        Hypothesis: In any problem where a grep solution is signif
        • For cheap thrills, I started a golf thread: Golf thread [develooper.com] which includes both a gawk and egrep version. The egrep version was "only" three times faster than the Perl version. :-( To write a 100-line Java program to solve such a trivial problem seems to me like killing an ant with a sledgehammer.

          egrep -v '\.(gif|GIF|jpg|JPG|css|CSS) HTTP|^192\.(9|18|29)\.' inf >e
          gawk '!/\.(gif|GIF|jpg|JPG|css|CSS) HTTP|^192\.(9|18|29)\./{print}' inf >a
          perl -ne'/^192\.(?:9|18|29)\./||/\.(?:gif|GIF|jpg|JPG|css|CSS) HTT

          --
          /-\
  • Why should he publish the code; it's not like there's more than one way to do it in Perl or anything...

    --
    J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
  • The java code posted in that link is not using Java Regular expressions; ie java.util.regex. Hence, the equivalent code in perl should not be using Perl Regular expressions, instead using index().

    This kind of think irks me. The author didn't even try to compare apples to apples. He compared fixed string indexing to perl regexes. Furhter, the code structure was fundementally different.

    What a joke.

    • "the equivalent code in perl should not be using Perl Regular expressions, instead using index()."

      What I think you were saying is that the Java code should have been using regexes, but if you were saying that without regexes (a relatively recent addition to Java), Perl's should have also been excluded, then I would have to disagree. If I am going to compare, say, the performance of C and Java, I can't argue that Java isn't allowed to use OO features because C lacks them. If I use both Perl and Prolog to

      • In a general language comparison, I would agree with your point. But the articles author threw down the guantlet of "Pattern Matching" then compared a trivial feature of Java, the string indexer method indexOf(), to the non-trivial regex engine of perl. I believe when comparing specific language features, you should try to keep the comparison as close to possible, OR make the bolder argument that disparate features are required by good idiomatic practice.
  • First thing to say is that the author is comparing substring matches with regex matches. Someone already posted code to convert the Perl version to substring matches.

    Second, this code:

    @fileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS");
    ...
        foreach $fileext (@fileext) {
            next LINE if ($_ =~ /$fileext HTTP/);
        }

    recompiles the regex every time it's evaluated. Something like this is better, methinks

  • As I said over in the original thread in Big Dave's journal, I hope someone will have the time to write a polite, professional and helpful reply that'll be sent both to the author of the article and to the whatever Java forum/editors are applicable. Remember: be polite. Being snide, condescending, snotty, aggressive, ironic, or whatever, will do only harm.

    From the technical viewpoints:
    (1) Perl code not published
    (2) input data not published
    (3) comparing apples to oranges
    (4) the Perl code is very slow for
  • No doubt that guy didnt saw this: Regular Expression Matching benchmarks [bagley.org]... the whole site is worth it:The Great Computer Language Shootout [bagley.org] as well as The Great Win32 Computer Language Shootout [dada.perl.it].

    freddo [netfirms.com]
  • It's not just about speed, though. It also matters how long it takes you to write the program and how maintainable and extensible it is.

    For speed I'm offering (untested, just jotted down quickly):

        #/usr/bin/perl -n
        for $e (qw/gif jpg css GIF JPG CSS/) {
            next if index($_, "$e HTTP") != -1
        }
        print, next if substr($_, 0, 4) eq '192.';
        next if substr($_, 4, 2) eq '9.';
        next if substr($_, 4, 3) eq '1
    • errr... this should be

      > next if index($_, "$e HTTP") != -1

          next LINE if index($_, "$e HTTP") != -1

      and

      > print, next if substr($_, 0, 4) eq '192.';

          print, next if substr($_, 0, 4) ne '192.';

      oh, for an edit interface... but you get the idea.

      marcel
  • As soon as I saw that ~100 line Java program, I immediately wrote a one-line Perl equivalent:

    while () { print unless /^(192\.(9|18|29)\.|\.(?i:(gif|jpg|css)) HTTP)/; }

    Of course, you could match his argument conventions precisely, but why bother? This form is the normal Perl way to do it, and the author's Perl and Java arguments were already different.

    I haven't benchmarked this one-liner, but I bet it's faster than the author's Perl version, and likely faster than the Java code as well. It might be a
    --

    Deven

    "Simple things should be simple, and complex things should be possible." - Alan Kay

    • You may want to double check the position of your ^

      As someone said somewhere (petdance iirc), when making optimized solutions, test. It's something a lot of people seem to not be doing in this thread (either here or in davorg's journal). If you're going to make it more efficient, you might as well make it produce the same results.

      At work, I produced a shiny new version of a previous routine. I couldn't really benchmark them though: the previous version processed much less data due to a bug in its impleme
      --
        ---ict / Spoon
  • Seriously folks, Java is a nice language and all, but why not use the right tool for the job? As demonstrated most effectively by Professor Hoffman, it's quite cumbersome to parse text files using Java. Now with Perl, you can do something like this:

    perl -ne "print unless /^192\.(9|18|29)\./o||/\.(gif|jpg|css|GIF|JPG|CSS) HTTP/o" < access-log > clean-log

    Heck, you could even make it your .sig. Not to mention that it runs faster than the Java version. The regex solution is even a tad bit faster tha

    • I sent John the one-liner and he was nice enough to test it himself. Results follow (emphasis mine)...

      From: John Hoffmann
      Date: Tue Sep 17 15:32:30 2002 (PDT)
      To: Brad Choate
      Subject: Re: Java vs. Perl

      Brad,

      Thanks, you were the second person to write, but the first guy couldn't offer an
      optimization. Just ran your one liner on 578 Meg file and it took half the time
      of the java
      .

      %timex perl -ne "print unless /^192\.(9|18|29)\./o || /\.(gif|jpg|css|GIF|JPG|CSS) HTTP/o" < developer.20020916.raw > develo
      • The java programmer who wrote the LogParse class wants to try JDK 1.4 with regular expressions and the new IO classes to see the result. I'll see what we can do to publish a round two of the optimized Perl and the new Java.
        Oddly enough, I can't see how that would make it go any faster than Java's non-regex solution. It seems like it would only lose ground!
        --
        • Randal L. Schwartz
        • Stonehenge