Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

ChrisDolan (2855)

ChrisDolan
  (email not shown publicly)
http://www.chrisdolan.net/

Journal of ChrisDolan (2855)

Thursday May 15, 2008
11:47 PM

A REALLY deep corner case

[ #36438 ]

Try out this test program in a Perl prior to 5.8.8:

use Test::More tests => 3;
my $line = "\x{4E00}();" . ' ';
is(length substr($line, 1, 1), 1);
is(length substr($line, 1, 4), 4);
is(length substr($line, 1, 1), 1);

You'd expect that substrings of length 1 are always length 1, right? On my Mac (perl5.8.6) it produces:

1..3
ok 1
ok 2
not ok 3
#   Failed test at utf8_substr.t line 5.
#          got: '4'
#     expected: '1'
# Looks like you failed 1 test of 3.

This should surprise you, unless perhaps you were aware of the UTF-8 length caching bug(s) that haunted much of the 5.8.x series.

This program above is a minimal reduction of a failure in the PPI test suite (see RT#35917 - charsets.t eats all available VM). This bug is only triggered in the following case:

  • Perl 5.8.6 (and maybe 5.8.7?)
  • PPI above 1.201
  • Source code which uses Unicode in a bareword on the last line of the file, but not within the last 3 bytes of the end.

We would probably have never noticed, except 5.8.6 is the default Perl for Mac OS X 10.4 (i.e., a popular point release) and a PPI side effect of the bug was a infinite loop with a memory leak.

I'm VERY grateful that the core Perl developers include people smart enough to find and fix subtle bugs in the Unicode implementation like this one.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • "Source code which uses Unicode in a bareword on the last line of the file, but not within the last 3 bytes of the end." ...on a Tuesday, with a full moon, while sacrificing a chicken, and singing the words to Sweet Home Alabama, while standing on the right foot only...
    • “SCSI is *not* magic. There are fundamental technical reasons why it is necessary to sacrifice a young goat to your SCSI chain now and then.” —John Woods