(Note: a first draft of this post has been lying on my shelf for over a year. Now, I finally got around to finishing it.)
Now and then in Perl forums, the discussion comes up about how bad it is to use grep in boolean context. And now it's even been poured into a Perl::Critic rule, based on Damian Conway's PBP book, and the argument they bring up is always the same (see the POD for the module):
Using
grepin boolean context is a common idiom for checking if any elements in a list match a condition. This works because boolean context is a subset of scalar context, and grep returns the number of matches in scalar context. A non-zero number of matches means a match.But consider the case of a long array where the first element is a match. Boolean
grepstill checks all of the rest of the elements needlessly. Instead, a better solution is to use theanyfunction from List::MoreUtils, which short-circuits after the first successful match to save time.
Now, why would you expect the item you're looking for to be the first one in the array? I see no reason for such an assumption, at all. On average, you still have to look through about half of the array items, if the item is even in the array. If it isn't, you still have to look through all items, anyway.
So, depending on the likelihood that an item is in the array, you might save between 0% and 50% of execution time by leaving the loop early. Personally, I don't find that overly impressive... As O(n) is still O(n).
The implied assumption is that the overhead of grep and any is ignorable, or at least, that it is the same for both, somebody that nobody actually bothered to verify. Well, I bothered. I decided to benchmark grep vs. any.
The kind of code that I benchmarked is the simple common problem of testing for the presence of a string in an array, just like IN in (some dialects of) SQL, and in_array of PHP.
Here are the prerequisites and the functions I tested:
my @letters = 'A'
.. 'Z';
my %letter; $letter{$_} = 1 foreach @letters;
# List::MoreUtils' any
any => sub { my $x = any { $_ eq $search } @letters },
# grep with expression
grepE => sub { my $x = grep $_ eq $search, @letters },
# grep with block
grepB => sub { my $x = grep { $_ eq $search } @letters },
# explicitly written code with foreach and last
foreach => sub { my $x=0; $_ eq $search and $x=1, last foreach @letters; },
# the expected overall winner: prefilled hash
hash => sub { my $x = exists $letter{$search} },
# hash on the fly, rebuilt on every test
temphash => sub { my %h; @h{@letters}=(); my $x = exists $h{$search} },
I searched for 'Z' (last item), 'M' (center item), 'C' (pretty up front), 'A' (first item), and 'banana' (not in the list). And here are the results (ActivePerl 5.8.8, Windows XP, 2.4GHz):
Search for Z
Rate Time any temphash foreach grepB grepE hash
any 50172/s 19.93µs -- -60% -65% -69% -71% -98%
temphash 125555/s 7.965µs 150% -- -12% -22% -27% -96%
foreach 142037/s 7.04µs 183% 13% -- -11% -18% -95%
grepB 160313/s 6.238µs 220% 28% 13% -- -7% -94%
grepE 172645/s 5.792µs 244% 38% 22% 8% -- -94%
hash 2828893/s 0.3535µs 5538% 2153% 1892% 1665% 1539% --
Search for M
Rate Time any temphash grepB grepE foreach hash
any 65938/s 15.17µs -- -47% -59% -62% -74% -98%
temphash 124203/s 8.051µs 88% -- -22% -29% -51% -95%
grepB 160049/s 6.248µs 143% 29% -- -8% -36% -94%
grepE 174201/s 5.74µs 164% 40% 9% -- -31% -94%
foreach 251362/s 3.978µs 281% 102% 57% 44% -- -91%
hash 2733577/s 0.3658µs 4046% 2101% 1608% 1469% 988% --
Search for C
Rate Time any temphash grepB grepE foreach hash
any 87975/s 11.37µs -- -29% -45% -50% -85% -97%
temphash 124136/s 8.056µs 41% -- -22% -29% -78% -96%
grepB 158854/s 6.295µs 81% 28% -- -9% -72% -95%
grepE 175068/s 5.712µs 99% 41% 10% -- -69% -94%
foreach 571092/s 1.751µs 549% 360% 260% 226% -- -80%
hash 2926393/s 0.3417µs 3226% 2257% 1742% 1572% 412% --
Search for A
Rate Time any temphash grepB grepE foreach hash
any 94486/s 10.58µs -- -23% -40% -47% -90% -96%
temphash 123096/s 8.124µs 30% -- -22% -30% -87% -95%
grepB 158407/s 6.313µs 68% 29% -- -10% -84% -94%
grepE 176637/s 5.661µs 87% 43% 12% -- -82% -93%
foreach 978762/s 1.022µs 936% 695% 518% 454% -- -64%
hash 2687559/s 0.3721µs 2744% 2083% 1597% 1422% 175% --
Search for banana
Rate Time any temphash foreach grepB grepE hash
any 51867/s 19.28µs -- -58% -62% -69% -72% -98%
temphash 123717/s 8.083µs 139% -- -9% -25% -34% -95%
foreach 136015/s 7.352µs 162% 10% -- -18% -28% -95%
grepB 165077/s 6.058µs 218% 33% 21% -- -12% -94%
grepE 187691/s 5.328µs 262% 52% 38% 14% -- -93%
hash 2670255/s 0.3745µs 5048% 2058% 1863% 1518% 1323% --
Well, that is looking bad for any: it is the worst performer in all cases. In the average case ('M'), it is almost 3 times slower than grep (grep with expression is, as I expected, a bit better than grep with a block, as the latter has an overhead of entering/exiting a lexical block). Even in its best case, any is still almost twice as slow as grep. So much for saving. The manually written loop is, in the average case, about a third faster than grep. But, if the item isn't found, it is actually slower.
No surprise that, if you really want a high speed test, and you need to test against the same array often, it is best to prepare a hash and simply test if the item is in the hash.
What I found rather surprising, is that populating a hash and using it once ('temphash'), isn't such a bad performer.
For completeness' sake, here's the same benchmark on Strawberry Perl 5.10.
Search for Z
Rate Time any temphash foreach grepE grepB hash
any 46995/s 21.28µs -- -60% -62% -65% -67% -98%
temphash 117471/s 8.513µs 150% -- -6% -13% -19% -96%
foreach 125313/s 7.98µs 167% 7% -- -7% -13% -95%
grepE 135294/s 7.391µs 188% 15% 8% -- -6% -95%
grepB 144589/s 6.916µs 208% 23% 15% 7% -- -95%
hash 2740628/s 0.3649µs 5732% 2233% 2087% 1926% 1795% --
Search for M
Rate Time any temphash grepE grepB foreach hash
any 63959/s 15.64µs -- -46% -54% -55% -71% -97%
temphash 118040/s 8.472µs 85% -- -15% -18% -46% -95%
grepE 138646/s 7.213µs 117% 17% -- -3% -36% -94%
grepB 143388/s 6.974µs 124% 21% 3% -- -34% -94%
foreach 218040/s 4.586µs 241% 85% 57% 52% -- -91%
hash 2516377/s 0.3974µs 3834% 2032% 1715% 1655% 1054% --
Search for C
Rate Time any temphash grepE grepB foreach hash
any 86710/s 11.53µs -- -25% -36% -39% -85% -96%
temphash 115756/s 8.639µs 33% -- -15% -19% -79% -95%
grepE 135977/s 7.354µs 57% 17% -- -5% -76% -94%
grepB 142657/s 7.01µs 65% 23% 5% -- -75% -94%
foreach 559639/s 1.787µs 545% 383% 312% 292% -- -76%
hash 2315599/s 0.4319µs 2571% 1900% 1603% 1523% 314% --
Search for A
Rate Time any temphash grepE grepB foreach hash
any 85902/s 11.64µs -- -26% -36% -41% -90% -96%
temphash 115752/s 8.639µs 35% -- -14% -21% -87% -95%
grepE 134479/s 7.436µs 57% 16% -- -8% -85% -94%
grepB 146193/s 6.84µs 70% 26% 9% -- -84% -94%
foreach 902933/s 1.108µs 951% 680% 571% 518% -- -62%
hash 2405894/s 0.4156µs 2701% 1978% 1689% 1546% 166% --
Search for banana
Rate Time any temphash foreach grepE grepB hash
any 51088/s 19.57µs -- -56% -61% -69% -71% -98%
temphash 116528/s 8.582µs 128% -- -12% -29% -33% -96%
foreach 132626/s 7.54µs 160% 14% -- -19% -24% -95%
grepE 164236/s 6.089µs 221% 41% 24% -- -6% -94%
grepB 175040/s 5.713µs 243% 50% 32% 7% -- -93%
hash 2655952/s 0.3765µs 5099% 2179% 1903% 1517% 1417% --
As an aside: I've got the distinct impression that overall, Strawberry Perl 5.10 is slower than ActivePerl 5.10, by about 5-10%. The biggest surprise to me, however, is that grep BLOCK is no longer faster than grep EXPRESSION. So weird.
Now, the thing you would test for must not always be that simple. If it is very slow or otherwise expensive, you might still, and rightfully so, want to leave the testing loop early. But, although elegant as a solution, you should not blindly assume that any is the best solution for every such problem... Though I wish that grep would be made smarter, and that it either knows to leave the test loop early, or that you could manually do so, for example, with last.
p.s. The code to convert Rate to Time was kindly supplied by GrandFather on Perlmonks.
p.p.s. I'm sorry about the formatting problem, that "µ" should look like "µ", but even though it is a "µ" character in the text I posted, the journaling system shows it as "µ". It's a bug, I'm sure.
Who cares about the sppeed? (Score:2)
They are all damn fast, and using
anyreads much better.Your reasoning is flawed (Score:1)
Your reasoning is flawed. You are assuming that there is one positive match in the set, and that the set is small. On such a small set you're right any isn't any faster, but superior algorithms rarely show up on small datasets.
If you would test this with a larger dataset, any would come out much better. On my computer, using the list
1..1000and a needle of 500, it is in fact faster. The advantage only gets bigger when the haystack gets bigger.You're right it's not always better, but it is when it matters t
Re: (Score:1)
I agree. With such a small set no method is going to take an irritatingly long time. Since you have the program handy could you give it a quick tweak to use a bigger array and re-post the results?
I also agree with autarch that
anyreads better, as a way of signalling programmer intent. I'd probably use theanyfromPerl6::Junctionrather thanList::MoreUtils; any chance you could add that to your testing as well? Thanks!List::Util::first (Score:1)
Can you test how good would perform List::Util::first()?
Re: (Score:2)
firstis always close to 20% slower thanany...Re: (Score:1)
In this benchmark [sourceforge.net] grep and first were searching for the middle element in the array.