OK, first of all, Happy New (Civil) Year to everybody. Then, I'd like to note that I enjoyed the Israeli 2007 Perl Workshop that I attended yesterday a lot, and would like to thank all the organisers for making it happen. I posted some notes from topics we discussed in the conference to the mailing list, so you may find it interest to read them. I may post a more thorough report later on.
Now, to the main topic of this post. I've been on Freenode's #perl the other day, when we were discussing how to count the number of lines in a file. Someone suggested opening the files, and then using <$fh> and counting the number of lines. Someone else suggested trapping the output of wc -l. Then someone argued that trapping the output of wc -l is non-portable and will cost one in a costy fork. But is it slower?
To check, I created a very large text file using the following command:
xargs cat > mega.xml
Here, I located all the files ending with
$ LC_ALL=C wc mega.xml
195594 1704386 17790746 mega.xml
Then I ran the following benchmark using it:
use Benchmark ':hireswallclock';
my $s = `wc -l mega.xml`;
open my $in, "<", "mega.xml";
my $ret = $.;
if (lo_count() != wc_count())
'wc' => \&wc_count,
'lo' => \&lo_count,
Benchmark: timing 100 iterations of lo, wc...
lo: 18.0495 wallclock secs (16.72 usr + 1.17 sys = 17.89 CPU) @ 5.59/s (n=100)
wc: 3.70755 wallclock secs ( 0.00 usr 0.03 sys + 1.77 cusr 1.91 csys = 3.71 CPU) @ 3333.33/s (n=100)
The wc method wins and is substantially faster. It's probably because wc is written in optimised C, and so counts the lines faster, despite the fact it had forked earlier.
For small files, the pure-Perl version wins. But for large files, wc is better. But naturally, it's not portable, which may be a deal-breaker in some cases.
The lesson of this is that forking processes or calling external is sometimes a reasonable thing to do. (as MJD noted earlier in the link).