Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I'm wondering how this takes regex greediness into account.

    What if you have a regex

    qr/(?:01)*/

    and you have a file which is several MB of random zeros and ones. If you read in

    111101010

    into your buffer, how can you know that the zero at the end of your buffer isn't about to be followed by another one?

    It seems that if you have greedy regex elements, then you may have to slurp in the whole file to be able to tell whether you've matched the longest posible record separator. One could write even more pathal

    • Greediness: if we get a match that leaves nothing in the buffer, then read some more into the buffer and try again until we either exhaust the file (for a regexp like /.*/s) or have a match that leaves something in the buffer.

      Pathological cases will cause the entire file to be read into memory, but I don't see a way around that. If your record separator is /.*/s then you're saying to Perl "the entire file is my record separator". I don't see a way to handle this except by reading the whole file. That's

      • > if we get a match that leaves nothing in the buffer

        But the point I was attempting to make was that whether or not something is left in the buffer is not the best indication of whether or not the RS matched enough stuff. Maybe I'd have to see the actual code, but from the description of it, it sounds like it could behave differently depending on how input matched up with buffer size. You could use the same data and RS and get different results depending on your buffer size.
        my $record_sep = qr/(?:01)*/;
        my $data = '11110101010101000001111001011110';
        # RS should be  ^^^^^^^^^^    ^^    ^^^^
        # but if buffer goes to ^ then we'll have a problem
        # code will use ^^^^^^^^ as the first separator
        The regex will match four '01' pairs and leave a '0' in the buffer not realizing that had it only read in a bit more text, it could have matched one more '01' pair as part of that first RS.

        In this next case, the problem isn't that it would slurp in the whole file, the problem is that it should slurp in the whole file, but from the description, it sounds like it might not actually do enough slurping, because there would still be unmatched text in the buffer. It's another example to show how 'text left in the buffer' may not be the best indicator given that regexes can be greedy.
        my $record_sep = qr/.*(?=.)/s
        my $data = 'abcdefghijklmnopqrstuvwxyz';
        Suppose your buffer size is five characters. First the buffer reads in
        abcde
        It applies the regex to that buffer and gets a match. The match indicates that the RS is
        abcd
        and that there is text still left unmatched in the buffer
        e
        So, it declares a success. There's an empty string for the first record, and 'abcd' as the first record separator. What happens after that depends on how you manage the buffer, but chances are that the end result would be something other than the expected result which would be two records: '' and 'z'. My guess would be that you'd end up with extra '' records, and that the number of extras would depend on the buffer and data size. For small buffers and/or larger data sizes, you'd have more extra '' records. ...but all of this is just guessing without looking at code :)

        -matt