Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

mdxi (4658)

mdxi
  (email not shown publicly)
http://mdxi.collapsar.net/

Journal of mdxi (4658)

Thursday December 11, 2003
11:56 PM

Weird unicode corruption

[ #16297 ]

I have since chosen to come at this from a different direction, eliminating the problem, but I'm still confused by the behavior I saw:

I am reading in a unicode text file, tab delimited. Each line is split on tab into an array and then some array elements are processed individually. One of them may contain katakana or hiragana text (on or kun readings of kanji). For my first whack at a database import, I just passed this field through untouched.

Later I added this code, which checks to see what script the field is in:

@readings = split(/,/,$word[4]);
foreach $chunk (@readings) {
    $chunk =~ s/\s*//g;
    if ($chunk =~ /\p{InKatakana}/) {
    $on .= $chunk . ",";
    } else {
    $kun .= $chunk . ",";
    }
}

Nothing really unusual, but this code causes other fields containing non-roman text to be corrupted. The corruption looks random, but is always the same fields on the same lines of the file. It is happening at the perl level, not the postgres level. Commenting out the code or backing out to a revision before it was added makes the problem go away.

I have absolutely zero idea what the actual issue could be. "Unicode" is the only obvious suspect. As mentioned above, I resolved and/or removed the issue by doing detection elsewhere, but if anyone knows WHY this happened, I'd sure like to know.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.