Stories
Slash Boxes
Comments

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

ChrisDolan (2855)

ChrisDolan
  (email not shown publicly)
http://www.chrisdolan.net/

Journal of ChrisDolan (2855)

Friday May 16, 2008
12:47 AM

A REALLY deep corner case

[ #36438 ]

Try out this test program in a Perl prior to 5.8.8:

use Test::More tests => 3;
my $line = "\x{4E00}();" . ' ';
is(length substr($line, 1, 1), 1);
is(length substr($line, 1, 4), 4);
is(length substr($line, 1, 1), 1);

You'd expect that substrings of length 1 are always length 1, right? On my Mac (perl5.8.6) it produces:

1..3
ok 1
ok 2
not ok 3
#   Failed test at utf8_substr.t line 5.
#          got: '4'
#     expected: '1'
# Looks like you failed 1 test of 3.

This should surprise you, unless perhaps you were aware of the UTF-8 length caching bug(s) that haunted much of the 5.8.x series.

This program above is a minimal reduction of a failure in the PPI test suite (see RT#35917 - charsets.t eats all available VM). This bug is only triggered in the following case:

  • Perl 5.8.6 (and maybe 5.8.7?)
  • PPI above 1.201
  • Source code which uses Unicode in a bareword on the last line of the file, but not within the last 3 bytes of the end.

We would probably have never noticed, except 5.8.6 is the default Perl for Mac OS X 10.4 (i.e., a popular point release) and a PPI side effect of the bug was a infinite loop with a memory leak.

I'm VERY grateful that the core Perl developers include people smart enough to find and fix subtle bugs in the Unicode implementation like this one.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • "Source code which uses Unicode in a bareword on the last line of the file, but not within the last 3 bytes of the end." ...on a Tuesday, with a full moon, while sacrificing a chicken, and singing the words to Sweet Home Alabama, while standing on the right foot only...
    • “SCSI is *not* magic. There are fundamental technical reasons why it is necessary to sacrifice a young goat to your SCSI chain now and then.” —John Woods