Stories
Slash Boxes
Comments

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

ziggy (25)

ziggy
  (email not shown publicly)
AOL IM: ziggyatpanix (Add Buddy, Send Message)

Journal of ziggy (25)

Thursday January 12, 2006
12:42 PM

Standards and Regexes

[ #28327 ]

I'm working on a project of a certain vintage, and of a certain age, that uses upwards of five programming languages to get stuff done. Annoying, but nowhere near uncommon. (There's a story about how JScheme is included in the JDK sources, because the code to generate CORBA classes for Java are written in Scheme...)

Luckily for me, I had a feature to add that traces through most of those languages all at once: Haskell -> Tcl -> XSLT -> Tcl. (The Perly bits form the backend system, not the frontend runtime components.) Thankfully, it was a simple fix: add wordbreak barriers around a regex being output from a Haskell program and sent upstream to heaven knows where.

Should be simple, right? Just replace "..." with "\\b(...)\\b", or some variant thereof. Easy peasy.

Except that the \b metacharacter is Perl syntax, and the regex isn't going to be processed by Perl. At one point, I though that this regex was going to be processed by a component written in C, using the GNU Regex library. Turns out that Perl, GNU Regex and PCRE all agree that \b is a word boundary. (POSIX regexes don't appear to know what a word boundary is...)

Yet none of the standard regex magic was working. Tracing through the code, I discovered that Tcl's regex engine was the one being used (by way of XSLT; about as convenient as a direct flight from Sydney to New York by way of Mars).

Looking over Tcl's regex docs, it turns out that \b is a backspace character!

Because matching backspace characters is such a common operation within a regex, Tcl preserves the C-style escape for \b, and uses \y for word boundaries.

WHAT ON EARTH WERE THEY THINKING?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • And what about this !\b" business? This is a regex thing: in Perl regular expressions, !\b" nor mally matches a word boundary, but within a character class, it matches a backspace. A word boundary would make no sense as part of a class, so Perl is free to let it mean something else. The warnings in the first chapter about how a character class's (sub language) is different from the main regex language certainly apply to Perl (and every other regex flavor as well).

    So the Tcl people weren't wrong, just di

  • I believe \b as backspace significantly predates \b as word boundary. I know that \b means backspace in PDF. Most ASCII charts that show escapes use \b for backspace (0x08). \b is backspace in ANSI C.

    So, Perl is the usurper, not TCL. That's not to say that Perl is wrong, of course. :-)
    • Not wrong, just different. The problem is when we Perl folks try to hold up Perl as "the" regex standard and it is not. The MRE book explains that different language regex implementations are a wee bit different from each other.

      • Different and annoying.

        The problem is that Perl's regex syntax is adopted as the gold standard whenever another language/library needs to beef up its regex handling. There's a very large common subset shared between Perl, PCRE, GNU Regex, and probably some Java library. In general, this is a good thing, because it means that regexes generally become normalized, at least for the common cases. There should be one (common) way to find word boundaries, but all bets are off on variable capture and executing c
  • I don't know tcl except for snippets I've absorbed over the years, but doesn't tcl use strings rather extensively? And does that not mean that the regex is entered as a string that later gets used as a regex (rather than being parsed as a regex when tcl first analyses the characters of the script)? That would mean that usurping \b for word boundary within a regex would also usurp \b for backspace in all other strings too. So, at the very least, you'd have to write it as \\b. (Having the string be proces