Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Shlomi Fish (918)

Shlomi Fish
  shlomif@iglu.org.il
http://www.shlomifish.org/
AOL IM: ShlomiFish (Add Buddy, Send Message)
Yahoo! ID: shlomif2 (Add User, Send Message)
Jabber: ShlomiFish@jabber.org

I'm a hacker of Perl, C, Shell, and occasionally other languages. Perl is my favourite language by far. I'm a member of the Israeli Perl Mongers, and contribute to and advocate open-source technologies. Technorati Profile [technorati.com]

Journal of Shlomi Fish (918)

Tuesday January 01, 2008
01:20 PM

Line Count Benchmark

[ #35260 ]

OK, first of all, Happy New (Civil) Year to everybody. Then, I'd like to note that I enjoyed the Israeli 2007 Perl Workshop that I attended yesterday a lot, and would like to thank all the organisers for making it happen. I posted some notes from topics we discussed in the conference to the mailing list, so you may find it interest to read them. I may post a more thorough report later on.

Now, to the main topic of this post. I've been on Freenode's #perl the other day, when we were discussing how to count the number of lines in a file. Someone suggested opening the files, and then using <$fh> and counting the number of lines. Someone else suggested trapping the output of wc -l. Then someone argued that trapping the output of wc -l is non-portable and will cost one in a costy fork. But is it slower?

To check, I created a very large text file using the following command:

locate .xml | grep '^/home/shlomi/Backup/Backup/2007/2007-12-07/disk-fs' | \
xargs cat > mega.xml

Here, I located all the files ending with .xml in my backup and concatenated them together into a file "mega.xml". The statistics for this file are:

$ LC_ALL=C wc mega.xml
195594 1704386 17790746 mega.xml

Then I ran the following benchmark using it:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark ':hireswallclock';

sub wc_count
{
my $s = `wc -l mega.xml`;
$s =~ /^(\d+)/;
return $1;
}

sub lo_count
{
open my $in, "<", "mega.xml";
local $.;
while(<$in>)
{
}
my $ret = $.;
close($in);
return $ret;
}

if (lo_count() != wc_count())
{
die "Error";
}

timethese(100,
{
'wc' => \&wc_count,
'lo' => \&lo_count,
}
);

The results?

shlomi:~/Download$ perl ../time-various-line-counts.pl
Benchmark: timing 100 iterations of lo, wc...
lo: 18.0495 wallclock secs (16.72 usr + 1.17 sys = 17.89 CPU) @ 5.59/s (n=100)
wc: 3.70755 wallclock secs ( 0.00 usr 0.03 sys + 1.77 cusr 1.91 csys = 3.71 CPU) @ 3333.33/s (n=100)

The wc method wins and is substantially faster. It's probably because wc is written in optimised C, and so counts the lines faster, despite the fact it had forked earlier.

For small files, the pure-Perl version wins. But for large files, wc is better. But naturally, it's not portable, which may be a deal-breaker in some cases.

The lesson of this is that forking processes or calling external is sometimes a reasonable thing to do. (as MJD noted earlier in the link).

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • ... and it is difficult to get it right. On FreeBSD there is usually some whitespace before the line count, so the regexp has to be changed to /^\s*(\d+)/.

    But the results on my system (amd64-freebsd) look different: using a text file with nearly 200000 lines the wc version makes only about 22 iterations/second, much slower than on your system. The the perl version seems to be faster than on your system: 9/s.

  • And now I see the trap: Benchmark.pm seems to not count the CPU time from child processes! So it's not 3333/s for the wc version, but only 26.9/s.
  • sub tr_count {
        local ( $/, $_ ) = \( 2**19 );
        my $c = 0;
        open my $in, "<", $file;
        $c += y/\n// while <$in>;
        return $c;
    }

    Only slightly slower than wc on my machine.