Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

+ -

  How compatable are Perl-compatable regular express on 2007.06.22 23:49 avar

Submitted by avar on 2007.06.22 23:49
User Journal
avar writes "Since I finished my changes to what will become the pluggable regex
API in perl 5.10 I've been working on writing re::engines again, first
on Plan 9 and then on finishing PCRE which audrey and yves started
(but didn't quite finish).

I based the new PCRE wrapper on the Plan 9 and upgraded the underlying
PCRE library to 7.2, and aside from a bug in how split is handled it
works for most of the cases where the Perl engine does.

Having wrapped PCRE running Perl's own regex tests under PCRE becomes
really easy. There are almost 1300 test for the regex syntax in
t/op/regex.t in perl core. Running these under re::engine::PCRE
reveals the following incompatibilities (and some bugs) between it and
Perl:

(?{}) and (??{}) tests fail (obviously). Getting at least (?{}) to
work might be possible with pcre's callout mechanims but I haven't
looked closely at that.

A few tests such as "bbbbXcXaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /.X(.+)+X/ fail because PCRE recurses away and runs into its internal
MATCH_LIMIT while recursing. Using pcre_dfa_exec() instead of
pcre_exec() yields a match but the DFA routine has different matching
semantics.

One test fails because PCRE treats [a-[:digit:]] as an invalid range
while Perl takes it to mean a character class matching 'a', '-' or a
[:digit:]. Perl is probably being too permissive in this case.

"aba" =~ /^(a(b)?)+$/; say "$1-$2" will yield "a-b" under PCRE but
"a-" under Perl. That is, PCRE eats the inner (b) while perl goes with
the outer +. Both match the entire string.

Perl accepts curly modifiers on (?!) e.g. /foo(?!bar){2}/ but PCRE
doesn't. I couldn't get Perl to do anything useful with that
though. /foo(?!bar{2})/ works in both engines and doesn't match "foo"
followed by "barbar".

PCRE does not match !>!>>>!>!>!> against
^(]+|(?3)|(?1))*>)()(!>!>!>)$ but Perl does. I haven't looked
into why.

PCRE does not support (*FAIL) and (*F) which cause the pattern to
fail, nor does it support (*ACCEPT).

PCRE accepts ^(?'0'ook)$, ^(?ook)$, ^(?ook)$ and so on which
all match the literal string "ook". Perl does not accept the garbage
and throws an error.

\R matches \x{85} under Perl but not under PCRE.

PCRE does not support multiple named match buffers under the same name
while Perl does. This is the most significant difference between the
two. Patterns like /(?ook)|(?eek)/ will work under Perl but
not under PCRE and while that particular example might be rewritten as /(?(?:ook|eek))/ which would work under both.

While some alterations might be rewritten /(?ook)(?eek)/ can
not. Under the perl engine ook and eek could be accessed as
$-{key}->[0] and $-{key}->[1].

So aside from the handling of named captures the two are pretty
compatable. Aside from that and inline eval re::engine::PCRE is pretty
much a drop-in replacement for Perl's engine. And perhaps more
significantly PCRE's compatability can now be tested (and errors
fixed) by running it against Perl's own test suite.

To run Perl's tests on PCRE get blead and re::engine::PCRE 0.10
(coming to a CPAN near you), build it and run:

        perl5.9.5 -Mblib t/perl/regexp.t

By default it skips the failing tests, these can currently be enabled
by commenting out line 86 in regexp.t:

        @pcre_fail{@pcre_fail} = ();"
Read More 0 comments

This discussion was created as logged-in users only. Log in and try again!
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login
Loading... please wait.