Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • by djberg96 (2603) on 2003.01.15 0:55 (#16021) Journal
    I have no idea what you've been smoking, because there's no way you should be getting that kind of discrepency. Either you're using a very old version of ruby, your interpreter is broken, or you've been sniffing glue again.

    For my results, I used ruby 1.6.7 and perl 5.8.0 on Mandrake 9. I took the sample text you gave in your journal entry and copied it over and over until I ended up with a 2.4 MB file. I used "bzip-0.21" as the target. Hopefully, I didn't screw up the logic.

    I've provided the exact benchmark code that I used and the results of the benchmark module. I've also included the Perl benchmark results and code. Benchmarks first:

    djberge>/usr/local/bin/ruby ruby_bench.rb
       user     system      total        real
    original:  5.390000   0.100000   5.490000 (5.509422)
    optimized:  2.650000   0.080000   2.730000 (2.741779)

    Ruby code:

    require "benchmark"
    include Benchmark

    target = "bzip-0.21"

    bm do |x|
       x.report("original:"){
          100.times do |iter|
            index = File.new("test.txt")
            count = 0
            index.each do |line|
              fields = line.chomp.split('|')
              if fields[0] =~ /#{target}/ || fields[6] =~ /#{target}/ || \
                  fields[3] =~ /#{target}/ || fields[7] =~ /#{target}/ || \
                  fields[8] =~ /#{target}/
                count += 1
              end
            end
            index.close
          end
       }
       x.report("optimized:"){
          regex = Regexp.new(target)

          for n in 1..100
             count = 0
             IO.foreach("test.txt"){|line|
                line.chomp.split("|").indices(0,6,3,7,8).each{ |e|
                   if regex.match(e)
                      count += 1
                   end
                }
             }
          end
       }
    end

    Perl benchmarks:

    djberge>perl perl_bench.pl
    Benchmark: timing 1 iterations of original...
      original:  3 wallclock secs ( 2.52 usr +  0.04 sys =  2.56 CPU) @  0.39/s (n=1)

    Perl code:

    use Benchmark;
    $target = "bzip-0.21";

    timethese(1,{
       "original" => q{
          for (my $iter = 0; $iter < 100; $iter++)
          {
            my $count = 0;
            open(INDEX, "test.txt") || die "Couldn't open file: $!\n";
            while (<INDEX>)
            {
              chomp;
              my @fields = split/\|/;
              if ($fields[0] =~ m{$target} || $fields[6] =~ m{$target} ||
                  $fields[3] =~ m{$target} || $fields[7] =~ m{$target} ||
                  $fields[8] =~ m{$target})
              {
                 $count++;
              }
            }
            close INDEX;
          }
       }
    });

    I didn't doing any serious averaging, but a few runs of each benchmark yielded no more than a tenth of a second difference.

    Please run this code and let me know what you get

    • Oops - those were the benchmarks against the 48k file. Here are the benchmarks against the 2.4mb file:

      Ruby:

      djberge>/usr/local/bin/ruby ruby_bench.rb
            user     system      total        real
      original:270.560000   2.740000 273.300000 (273.258903)
      optimized:134.710000   1.890000 136.600000 (136.578120)

      Perl:

      djberge>perl perl_bench.pl
      Benchmark: timing 1 iterations of original...
        original: 129 wallclock sec

    • My reply is actually to the original post, not the first reply, but I couldn't find a link to comment on that.

      Anyway, I actually was able to get the benchmark on the Perl test significantly lower by doing 2 things:
         I precompiled the regexp
         I joined the relevant search fields using an (assumedly) unused char (^A) and searched on that.

      On my box that put the average from around 37 secs. to around 26 secs. (Using djberg96's benchmark version of the script).

      use Benchmark;
      use strict;
      my
      • Yep. Precompiling the regex and joining the fields to be searched shaved a couple of seconds off the Perl script.

        Thanks.
        --
        Buck
    • Since the 'optimized' ruby code doesn't short-circuit testing the rest of the fields on a successful match, you could make the perl a little more perlish also:
      $count++ if grep m{$target}, (split /\|/) [0,3,6..8]
      Or use List::Util::first instead of grep (though it may only be an improvement on bigger arrays).

      I'm using perl 5.6.1 and ruby 1.6.8 and getting ruby about twice as slow as perl.

      • You're right - I forgot to short circuit. I'm not sure how that helps Ruby's case, though. Add a "break" after the "count += 1" line. It didn't seem to improve performance significantly for me, though.
        • The match only occurs in 1 out of every 6 lines (using your target and the sample data in the top post here), so you'd only see at most about a 16% benefit (if that). If the match occurred early in the string on more lines, there might be more benefit.

          I just installed Ruby today, and have been poking through online docs earlier, and couldn't find a 'break' or 'last' statement. Is there such a thing? The best I could come up with was throwing an exception and catching it outside that loop. I still need to g

          • I still need to get a Ruby book...

            Visit rubycentral [rubycentral.com] or ruby-doc [ruby-doc.org].

            The first link is an online version of Programming Ruby, aka "The Pickaxe". You can still buy that book at the store, if you prefer paper.

            • I had looked through the online book, but couldn't find a 'break' at first. I finally did find 'break', 'next', and 'redo' in the Expressions section; I had been looking in the Iterators section.

              I was poking through the bookstore and the only Ruby book there was Sam's "Learn Ruby in 21 days". I can't recommend it, as it had no mention of 'break', 'next', or 'redo', nor the IO.foreach method in your example (and it was a thick book).

    • OK. Tried your version of the Ruby script. My ruby is version 1.6.8 on an Athlon 500mHz system running FreeBSD 4.7-STABLE. Used the /usr/ports/INDEX file that the sample data I originally posted came from; it's 3MB in size. However, I didn't use any language specific benchmark modules; I wanted to compare apples to apples. Anyway, here's the results:

      [ayeka:~/portfinder] buck> repeat 5 time ruby pftest2.rb ruby
      68.394u 2.119s 1:10.56 99.9% 4+1346k 0+0io 0pf+0w
      69.770u 2.258s 1:12.08 99.9% 4+1346k 0+0
      --
      Buck
      • Ruby:

        127.33user 2.35system 2:10.31elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (222major+301minor)pagefaults 0swaps

        Perl:

        126.28user 1.74system 2:08.32elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (374major+164minor)pagefaults 0swaps

        Perhaps it's a NetBSD issue? Seems unlikely, but based on the results you're getting versus what I'm getting, I'd consider it a possibility at least. At least I cut it down to x2 instead of x4!

        Please consider posting to the mail

        • Perhaps it's a NetBSD issue? (Buck: FreeBSD even :) ) Seems unlikely, but based on the results you're getting versus what I'm getting, I'd consider it a possibility at least. At least I cut it down to x2 instead of x4!
          Agreed.

          Please consider posting to the mailing list with this info (ruby-talk@ruby-lang.org).
          I'd like to try these on my TiBook with OSX 10.2.3 first, though I don't expect much of a change. Is the mailing list archived somewhere where I can research before posting anything?

          By the way,

          --
          Buck
          • You can find the archives at http://blade.nagaokaut.ac.jp/ruby/ruby-talk/index.shtml

            There's a gateway between the mailing list and comp.lang.ruby, so you can search via deja (or your local news serve) and get everything from the mailing list that way.

            FreeBSD even :) - Oops. Probably not the first time I made that mistake. Probably won't be the last. :-P