Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

gnat (29)

gnat
  (email not shown publicly)

Journal of gnat (29)

Friday April 11, 2003
11:18 AM

$RS as Regex

[ #11599 ]
I thought "how hard can this be, really?" and wrote some code to read records from a filehandle where the record separator is specified by a regular expression. How hard? Let me quote from my braindump into the Cookbook recipe:

The basic logic is simple: keep a buffer of text read from the file. Try to find a match for the record separator. If we can't find a match, read more text and try again. If we do get a match, then whatever came before the record separator was the record. Stop when you can't match and there's no more data to read into the buffer.

The code is complicated, however, by special cases. When your regular expression matches the empty string, you should get back your data one character at a time. If you find a match for the record separator and have consumed all the data currently in the buffer, you can't be sure that there isn't more of the record separator waiting to be read from the filehandle. So if there's a successful match, you need to put the record and separator back into the buffer, read more data, and try again. And keep trying until you run out of data in the filehandle or you get a match that leaves data in the buffer.

Tom's sanity-checking my code now, but when he's done I'd like to find someone willing (foolish?) to turn it into a CPAN module. I don't have time to do the distro framework, write more tests (I used Test::More when writing the code, and T::M truly rocks), or package it usefully. (I have a stab at a tied filehandle interface so that you can get <FH> even when $RS is a regex).

Any volunteers?

--Nat
(you can tell it's a first draft because I bounce around between we and you, one of my weaknesses)

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I'm wondering how this takes regex greediness into account.

    What if you have a regex

    qr/(?:01)*/

    and you have a file which is several MB of random zeros and ones. If you read in

    111101010

    into your buffer, how can you know that the zero at the end of your buffer isn't about to be followed by another one?

    It seems that if you have greedy regex elements, then you may have to slurp in the whole file to be able to tell whether you've matched the longest posible record separator. One could write even more pathal

    • Greediness: if we get a match that leaves nothing in the buffer, then read some more into the buffer and try again until we either exhaust the file (for a regexp like /.*/s) or have a match that leaves something in the buffer.

      Pathological cases will cause the entire file to be read into memory, but I don't see a way around that. If your record separator is /.*/s then you're saying to Perl "the entire file is my record separator". I don't see a way to handle this except by reading the whole file. That's

      • > if we get a match that leaves nothing in the buffer

        But the point I was attempting to make was that whether or not something is left in the buffer is not the best indication of whether or not the RS matched enough stuff. Maybe I'd have to see the actual code, but from the description of it, it sounds like it could behave differently depending on how input matched up with buffer size. You could use the same data and RS and get different results depending on your buffer size.

        my $record_sep = qr/(?:01)*