Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Matts (1087)

Matts
  (email not shown publicly)

I work for MessageLabs [messagelabs.com] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.

Journal of Matts (1087)

Wednesday May 31, 2006
05:01 PM

Finally some perl!

[ #29767 ]

Yes it's been yonks since I posted any perl. Well today I learned that read() can take an offset to where to put data in your $buf, so I can implement what should be an efficient grep for multiple strings in binary data (i.e. where I can't do: while (<$fh>) ). So given $fh and @strings and max() from List::Util, I can do this:

    my $max_len = max(map{length} @strings);
    my $regexp = "(" . join("|", map {quotemeta} @strings) . ")";
    my $buf = '';
    while (1) {
        substr($buf, 0, length($buf) - $max_len) = "";
        my $len = read($fh, $buf, 8096, length($buf));
        last unless $len;
        if ($buf =~ /$regexp/o) {
            return $1;
        }
    }

I could probably add code to show where in the file it matched, but I don't need that.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I've found something more like the following to be useful:

                    my $max_len = max(8096,map{length} @strings);
                    my $regexp = "(" . join("|", map {quotemeta} @strings) . ")";
                    my @buf = ('');
                    while (1) {
                            my $len =
  • I hope you're using this code only as part of the mainline of a script and not as a subroutine or in a module. But even still, it's often the case that scripts turn into modules at some point, so don't use /o if you can help it. It doesn't really buy you anything and could actually cause problems as your program grows. If you think you need /o, you really just need qr//. When someone uses /o it is almost always a case of premature optimization (99.99999999999% of the time :-).
  • It’s easy to arrange:

    my $max_len = max map length, @strings;
    my $rx = qr/(@{[ join '|', map quotemeta, @strings ]})/;
    my $buf = '';
    while ( read $fh, $buf, 8096, length $buf ) {
        return $1 if $buf =~ $rx;
        substr $buf, 0, length $buf - $max_len, '';
    }

    As a bonus the code is shorter and clearer.

    The following is a tweak to avoid unnecessarily shrinking $buf when the space is needed for the read that immediately follows, and it may or may not be faster by a few percent. I didn