Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

avar (6604)

avar
  (email not shown publicly)

Journal of avar (6604)

Friday June 22, 2007
11:49 PM

Partly-compatable regular expressions

[ #33585 ]
Since I finished my changes to what will become the pluggable regex
API in perl 5.10 I've been working on writing re::engines again, first
on Plan 9 and then on finishing PCRE which audrey and yves started
(but didn't quite finish).

I based the new PCRE wrapper on the Plan 9 and upgraded the underlying
PCRE library to 7.2, and aside from a bug in how split is handled it
works for most of the cases where the Perl engine does.

Having wrapped PCRE running Perl's own regex tests under PCRE becomes
really easy. There are almost 1300 test for the regex syntax in
t/op/regex.t in perl core. Running these under re::engine::PCRE
reveals the following incompatibilities (and some bugs) between it and
Perl:

(?{}) and (??{}) tests fail (obviously). Getting at least (?{}) to
work might be possible with pcre's callout mechanims but I haven't
looked closely at that.

A few tests such as "bbbbXcXaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~
/.X(.+)+X/ fail because PCRE recurses away and runs into its internal
MATCH_LIMIT while recursing. Using pcre_dfa_exec() instead of
pcre_exec() yields a match but the DFA routine has different matching
semantics.

One test fails because PCRE treats [a-[:digit:]] as an invalid range
while Perl takes it to mean a character class matching 'a', '-' or a
[:digit:]. Perl is probably being too permissive in this case.

"aba" =~ /^(a(b)?)+$/; say "$1-$2" will yield "a-b" under PCRE but
"a-" under Perl. That is, PCRE eats the inner (b) while perl goes with
the outer +. Both match the entire string.

Perl accepts curly modifiers on (?!) e.g. /foo(?!bar){2}/ but PCRE
doesn't. I couldn't get Perl to do anything useful with that
though. /foo(?!bar{2})/ works in both engines and doesn't match "foo"
followed by "barbar".

PCRE does not match <<!>!>!>><>>!>!>!> against
^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$ but Perl does. I haven't looked
into why.

PCRE does not support (*FAIL) and (*F) which cause the pattern to
fail, nor does it support (*ACCEPT).

Three tests try to match \x{85} against \R in an UTF-8 upgraded string
("\305\205") in a pattern that wasn't compiled with PCRE_UTF8. This
isn't a PCRE issue but an API usage problem in re::engine::PCRE, the
best solution is probably to upgrade all patterns and strings to
UTF-8 before calling pcre_compile/pcre_exec.

PCRE accepts numeric keynames such as ^(?'0'ook)$, ^(?<0>ook)$,
^(?<1a>ook)$. These all match the literal string "ook" and set up the
named capture "0" or "1a". Perl does not currently accept named
buffers that start with a number.

re::engine::PCRE doesn't support multiple named match buffers under
the same name while Perl does. At first I thought this was a PCRE
limitation but it turns out that I just didn't know about the
PCRE_DUPNAMES option:)

So aside from inline eval re::engine::PCRE is pretty much a drop-in
replacement for Perl's engine. And perhaps more significantly PCRE's
compatability can now be tested (and errors fixed) by running it
against Perl's own test suite.

To run Perl's tests on PCRE get blead and re::engine::PCRE 0.10
(coming to a CPAN near you), build it and run:

    perl5.9.5 -Mblib t/perl/regexp.t

By default it skips the failing tests, these can currently be enabled
by commenting out line 86 in regexp.t:

    @pcre_fail{@pcre_fail} = ();
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.