Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • by bschoate (202) on 2002.09.17 17:25 (#12972) Homepage

    Seriously folks, Java is a nice language and all, but why not use the right tool for the job? As demonstrated most effectively by Professor Hoffman, it's quite cumbersome to parse text files using Java. Now with Perl, you can do something like this:

    perl -ne "print unless /^192\.(9|18|29)\./o||/\.(gif|jpg|css|GIF|JPG|CSS) HTTP/o" < access-log > clean-log

    Heck, you could even make it your .sig. Not to mention that it runs faster than the Java version. The regex solution is even a tad bit faster than testing individual values using the index function. Go figure.

    Those Sun engineers should find better things to do with their time.

    • I sent John the one-liner and he was nice enough to test it himself. Results follow (emphasis mine)...

      From: John Hoffmann
      Date: Tue Sep 17 15:32:30 2002 (PDT)
      To: Brad Choate
      Subject: Re: Java vs. Perl


      Thanks, you were the second person to write, but the first guy couldn't offer an
      optimization. Just ran your one liner on 578 Meg file and it took half the time
      of the java

      %timex perl -ne "print unless /^192\.(9|18|29)\./o || /\.(gif|jpg|css|GIF|JPG|CSS) HTTP/o" < developer.20020916.raw > develo
      • The java programmer who wrote the LogParse class wants to try JDK 1.4 with regular expressions and the new IO classes to see the result. I'll see what we can do to publish a round two of the optimized Perl and the new Java.
        Oddly enough, I can't see how that would make it go any faster than Java's non-regex solution. It seems like it would only lose ground!
        • Randal L. Schwartz
        • Stonehenge
      • Those two o's above, as in /.../o, seem quite useless because the regex's are constant. Is there a reason for them?

        • I guess not :) Silly me, I thought /o always helped when using the same regex pattern in a loop such as this. And I hadn't thought about specifying the non-capturing syntax, also suggested in this thread. The final result:

          perl -ne "print unless /^192\.(?:9|18|29)\./||/\.(?:gif|jpg|css|GIF|JPG|CSS) HTTP/" < input > output

          The fastest of all so far... any other improvements?

          • Try the following... if you are much more likely to have gif, jpg, css file than local files switch the regexps around and try:

            perl -ne 'print unless /\.(?:gif|jpg|css|GIF|JPG|CSS) HTTP/||/^192\.(?:9|18|29)\./;' input > output

      • Wouldn't an immediate retraction be in order, showing the perl one-liner and java 100-liner side by side with corrected timings? (and a note that the benchmark was designed for the purposes of java advocacy).
    • It is faster still if you use non-capturing parens. i.e. change (9|18|29) to (?:9|18|29), ditto for the parens around gif|jpg etc. And the 'o' modifier should be removed.

      • It's much faster still if you don't use alternation in the regex. /foo/ || /bar/ is significantly faster than /(foo|bar), since the former will be optimized to a pair of substr matches.