I have been writing/supporting a localized and globalized web application that used(still uses) shift jis for japanese character encoding. Our newest webapps uses many better technologies (Mason which I really really really like) and UTF8 for character encoding inside both the middleware and web app when handling of text and filenames. We do a lot of file processing (up and down). With that amount of filename processing (including striping off path or renaming when filenames exceed lengths, I ran into many shift jis characters that required special processing to protect them.
We have a couple of basename (subs, functions, etc. depending upon language) that detect and guard against shift jis slamming. Generally they follow this form:
my $bn; ## basename string
while ($loc <= length($path)) {
my $chr = substr($path, $loc, 1);
## grab the basename if we match the
## directory sep
if ($chr eq $sep) {
$bn = substr($path, $loc+1, length($path));
$loc++;
next;
}
## it's in the ascii range so it's a single byte
## character, only move forward one character
if ($chr =~/[\x00-\x7f]/) {
$loc++;
next;
}
## first is dbl byte, skip following character which
## is part of the dbl character
$loc++; $loc++;
}
This basically walks each character looking for magic hex pattern before '/' (5C) or '\' (2F) if (815C, 825C, 835C... range), since shift jis uses as part of the character. What A Pain in the ass...
But the blessed UTF8 does not require any of that crap. Yeah! And now I can use (at least for Perl) standard modules. See:
my $path = shift;
## only change for mac or win. (unix ok)
if (isWin) {
fileparse_set_fstype("MSWin32");
} elsif (isMac) {
fileparse_set_fstype("MacOS");
}
## default to unix since it rules...
my $b = basename($path);
fileparse_set_fstype("Unix"); ## reset it to avoid later issues
return($b);
Happiness... until the wreslting with content dispositions and utf8, another story for another day..
Well for the last couple of days, I've struggled in figuring out how to have Perl tell me that the current string inside a scalar is actually UTF8 or something else.
The first thing I tried was using the internal 'utf::valid' command. Well according to this everything (including values I knew where shift jis) was valid utf8. Later, I found (in some very useful documentation) that this will only tell you what Perl is storing it as not if the value is actually UTF8. But thanks to a very nice entry in the perluniintro page, that you can figure out if something is utf8 by simple decoding it. If it doesn't work that the value is not utf8. The Encode module is useful for that.
One other bit I've learned working with UTF8, shift JIS and other character encodings. It pays to use test values in URI (or HTML escaped) strings, then you can unescape them before your test script (or main application code) messes with the string. Then you can escape them to prevent problems with older (or basic) terms (xterm, my redhat 7.2 machine, etc.). Must better than having to pipe the output to less or xod. Also, it makes it easy to grab a html escape value from a logfile and then pass it as a command line arg to your test script (with unescapes it).
Just a couple thoughts for the end of the week.
oops... I meant Perl