I got roped into trying to help a friend of a friend extract some reports from their billing application data. I got given a 26 meg data file to play with, and some digging with 'strings' helped me find bits of data mixed in with the binary gibberish. I came up with the following code:
my $person = qr/[\x08\x09]([A-Z]{8})/;
my $lo = qr/[\x00-x20]{1,3}/;
my $id = qr/OH${lo}7${lo}(\d{5})${lo}AA/;
my $date = qr/\x06(\d\d)(\d\d)(\d\d)/;
my $time = qr/\x08(\d\d):(\d\d):(\d\d)/;
my $money = qr/([1-9]\d+\.\d\d)/;
my $chars = qr/[\x20-\x7E]+?/;
my $desc = qr/\x17($chars)\x00/;
while ($data =~/$person.*?$id.*?$date.*?$desc.*?$time.*?$time.*?$money/gs) {
my ($p_id, $r_id, $d, $text, $t1, $t2, $cost)
= ($1, $2, "$3/$4/$5", $6, "$7:$8:$9", "$10:$11:$12", $13);
print "$p_id: $r_id -- $d $t1 -> $t2 -- [$text] \$$cost\n";
}
I was happy, I'd got back a bunch of records that looked sensible. The problem was that I wasn't getting some of the records that I saw in the file. Changing my $person to qr/[\x08\x09](G[A-Z]{7})/ or qr/[\x08\x09](S[A-Z]{7})/ gives me a bunch of different records. But why would (S[A-Z]{7}) give different results than ([A-Z]{8})?. I'm stumped.
three regexen (Score:2)
I'm not sure if I'm misreading this, but it looks like you have three different regular expressions there:
Re:three regexen (Score:2)
Sorry for not being clear. I'd expect
Re:three regexen (Score:2)
If so, I'd be stumped too
-matt
Re:three regexen (Score:1)
If there is any possibility of accented 'national' characters (which there always is in unconstrained data) '\w' is much preferred to [A-Za-z] or [A-Z]/i.
I'd worry that some 'persons' might actually be shorter than 8 chars, or have spaces or lower case in some systems. (van Helsing etc)
What strings(1)
Bill
# I had a sig when sigs were cool
use Sig;
Dangers of .*? (Score:2)
It may not be what's happening in this case, but a common oversight in constructing regexes is to assume that