Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Buck (2837)

Buck
  buck@buckpyland.com
http://www.buckpyland.com/
AOL IM: BuckPyl (Add Buddy, Send Message)

Is there anything I haven't worked on/with? 8) I like Perl because it let's me code the way I think. Scary, huh?

Journal of Buck (2837)

Tuesday January 14, 2003
10:48 PM

Perl, Ruby and Python (Part Deux)

[ #9979 ]

Continuing with my comments made here, here's the source for the Perl and Ruby scripts I used to test with. The Python version is included for ... completeness (I'm being polite, here :).

#!/usr/bin/perl

my $target = shift;
for (my $iter = 0; $iter < 100; $iter++)
{
  my $count = 0;
  open(INDEX, "/usr/ports/INDEX");
  while (<INDEX>)
  {
    chomp;
    my @fields = split/\|/;
    if ($fields[0] =~ m{$target} || $fields[6] =~ m{$target} ||
        $fields[3] =~ m{$target} || $fields[7] =~ m{$target} ||
        $fields[8] =~ m{$target})
    {
       $count++;
    }
  }
  close INDEX;
}


#!/usr/local/bin/ruby
# distribution-name|port-path|installation-prefix|comment| \
# description-file|maintainer|categories|build deps|run deps|www site
target = ARGV[0]
100.times do |iter|
  index = File.new("/usr/ports/INDEX")
  count = 0
  index.each do |line|
    fields = line.chomp.split('|')
    if fields[0] =~ /#{target}/ || fields[6] =~ /#{target}/ || \
        fields[3] =~ /#{target}/ || fields[7] =~ /#{target}/ || \
        fields[8] =~ /#{target}/
      count += 1
    end
  end
  index.close
end


#!/usr/local/bin/python

import sys
import string
import re
target = re.compile(sys.argv[1])

for iter in range(100):
    index = open('/usr/ports/INDEX', 'r')
    count = 0
    line = index.readline()
    while line <> '' :
        fields = line.strip().split('|')
        if target.search(fields[0]) or target.search(fields[6]) \
                 or target.search(fields[3]) or target.search(fields[7]) \
                 or target.search(fields[8]):
             count = count + 1
        line = index.readline()
    index.close()

An example of the data used (from /usr/ports/INDEX in FreeBSD)
# distribution-name|port-path|installation-prefix|comment| \
# description-file|maintainer|categories|build deps|run deps|www site
9e-1.0|/usr/ports/archivers/9e|/usr/local|Explode Plan9 archives|/usr/ports/archivers/9e/pkg-descr|ports@FreeBSD.Org|archivers|||http:// www.eecs.harvard.edu/~wkj/Software/9e/
arc-5.21e.8_1|/usr/ports/archivers/arc|/usr/local|Create & extract files from DOS .ARC files|/usr/ports/archivers/arc/pkg-descr|ache@FreeBSD.org|archivers|||
arj-3.10b|/usr/ports/archivers/arj|/usr/local|Open-source ARJ|/usr/ports/archivers/arj/pkg-descr|kot@premierbank.dp.ua|archivers|autoconf2 13-2.13.000227_5 expat-1.95.5 gettext-0.11.5_1 gmake-3.80 libiconv-1.8_2 m4-1.4_1||http://arj.sourceforge.net/
bzip-0.21|/usr/ports/archivers/bzip|/usr/local|A block-sorting file compressor|/usr/ports/archivers/bzip/pkg-descr|ports@FreeBSD.org|archivers|||htt p://www.muraroa.demon.co.uk/
bzip2-1.0.2|/usr/ports/archivers/bzip2|/usr/local|A block-sorting file compressor|/usr/ports/archivers/bzip2/pkg-descr|jharris@widomaker.com|archivers| ||http://sources.redhat.com/bzip2/
cabextract-0.6|/usr/ports/archivers/cabextract|/usr/local|A program to extract Microsoft cabinet (.CAB) files|/usr/ports/archivers/cabextract/pkg-descr|sobomax@FreeBSD.org|archivers|li bgnugetopt-1.2|libgnugetopt-1.2|http://www.kyz.uklinux.net/cabextract.php3

The average runtimes, minutes and seconds, have been (for running each of the above scripts once):
Perl: 0:35
Python: 0:38
Ruby: 2:29

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I have no idea what you've been smoking, because there's no way you should be getting that kind of discrepency. Either you're using a very old version of ruby, your interpreter is broken, or you've been sniffing glue again.

    For my results, I used ruby 1.6.7 and perl 5.8.0 on Mandrake 9. I took the sample text you gave in your journal entry and copied it over and over until I ended up with a 2.4 MB file. I used "bzip-0.21" as the target. Hopefully, I didn't screw up the logic.

    I've provided the exact b

    • Oops - those were the benchmarks against the 48k file. Here are the benchmarks against the 2.4mb file:

      Ruby:

      djberge>/usr/local/bin/ruby ruby_bench.rb
            user     system      total        real
      original:270.560000   2.740000 273.300000 (273.258903)
      optimized:134.710000   1.890000 136.600000 (136.578120)

      Perl:

      djberge>perl perl_bench.pl
      Benchmark: timing 1 iterations of original...
        original: 129 wallclock sec

    • My reply is actually to the original post, not the first reply, but I couldn't find a link to comment on that.

      Anyway, I actually was able to get the benchmark on the Perl test significantly lower by doing 2 things:
         I precompiled the regexp
         I joined the relevant search fields using an (assumedly) unused char (^A) and searched on that.

      On my box that put the average from around 37 secs. to around 26 secs. (Using djberg96's benchmark version of the script).

      use Benchmark;
      use strict;
      my
      • Yep. Precompiling the regex and joining the fields to be searched shaved a couple of seconds off the Perl script.

        Thanks.
        --
        Buck
    • Since the 'optimized' ruby code doesn't short-circuit testing the rest of the fields on a successful match, you could make the perl a little more perlish also:
      $count++ if grep m{$target}, (split /\|/) [0,3,6..8]
      Or use List::Util::first instead of grep (though it may only be an improvement on bigger arrays).

      I'm using perl 5.6.1 and ruby 1.6.8 and getting ruby about twice as slow as perl.

      • You're right - I forgot to short circuit. I'm not sure how that helps Ruby's case, though. Add a "break" after the "count += 1" line. It didn't seem to improve performance significantly for me, though.
        • The match only occurs in 1 out of every 6 lines (using your target and the sample data in the top post here), so you'd only see at most about a 16% benefit (if that). If the match occurred early in the string on more lines, there might be more benefit.

          I just installed Ruby today, and have been poking through online docs earlier, and couldn't find a 'break' or 'last' statement. Is there such a thing? The best I could come up with was throwing an exception and catching it outside that loop. I still need to g

          • I still need to get a Ruby book...

            Visit rubycentral [rubycentral.com] or ruby-doc [ruby-doc.org].

            The first link is an online version of Programming Ruby, aka "The Pickaxe". You can still buy that book at the store, if you prefer paper.

            • I had looked through the online book, but couldn't find a 'break' at first. I finally did find 'break', 'next', and 'redo' in the Expressions section; I had been looking in the Iterators section.

              I was poking through the bookstore and the only Ruby book there was Sam's "Learn Ruby in 21 days". I can't recommend it, as it had no mention of 'break', 'next', or 'redo', nor the IO.foreach method in your example (and it was a thick book).

    • OK. Tried your version of the Ruby script. My ruby is version 1.6.8 on an Athlon 500mHz system running FreeBSD 4.7-STABLE. Used the /usr/ports/INDEX file that the sample data I originally posted came from; it's 3MB in size. However, I didn't use any language specific benchmark modules; I wanted to compare apples to apples. Anyway, here's the results:

      [ayeka:~/portfinder] buck> repeat 5 time ruby pftest2.rb ruby
      68.394u 2.119s 1:10.56 99.9% 4+1346k 0+0io 0pf+0w
      69.770u 2.258s 1:12.08 99.9% 4+1346k 0+0
      --
      Buck
      • Ruby:

        127.33user 2.35system 2:10.31elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (222major+301minor)pagefaults 0swaps

        Perl:

        126.28user 1.74system 2:08.32elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (374major+164minor)pagefaults 0swaps

        Perhaps it's a NetBSD issue? Seems unlikely, but based on the results you're getting versus what I'm getting, I'd consider it a possibility at least. At least I cut it down to x2 instead of x4!

        Please consider posting to the mail

        • Perhaps it's a NetBSD issue? (Buck: FreeBSD even :) ) Seems unlikely, but based on the results you're getting versus what I'm getting, I'd consider it a possibility at least. At least I cut it down to x2 instead of x4!
          Agreed.

          Please consider posting to the mailing list with this info (ruby-talk@ruby-lang.org).
          I'd like to try these on my TiBook with OSX 10.2.3 first, though I don't expect much of a change. Is the mailing list archived somewhere where I can research before posting anything?

          By the way,

          --
          Buck
          • You can find the archives at http://blade.nagaokaut.ac.jp/ruby/ruby-talk/index.shtml

            There's a gateway between the mailing list and comp.lang.ruby, so you can search via deja (or your local news serve) and get everything from the mailing list that way.

            FreeBSD even :) - Oops. Probably not the first time I made that mistake. Probably won't be the last. :-P