Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Lecar_red (5694)

  (email not shown publicly)

Journal of Lecar_red (5694)

Tuesday March 08, 2005
05:59 PM

why UTF8 is wonderful

I have been writing/supporting a localized and globalized web application that used(still uses) shift jis for japanese character encoding. Our newest webapps uses many better technologies (Mason which I really really really like) and UTF8 for character encoding inside both the middleware and web app when handling of text and filenames. We do a lot of file processing (up and down). With that amount of filename processing (including striping off path or renaming when filenames exceed lengths, I ran into many shift jis characters that required special processing to protect them.

We have a couple of basename (subs, functions, etc. depending upon language) that detect and guard against shift jis slamming. Generally they follow this form:

    my $bn; ## basename string
    while ($loc <= length($path)) {
        my $chr = substr($path, $loc, 1);

        ## grab the basename if we match the
        ## directory sep
        if ($chr eq $sep) {
            $bn = substr($path, $loc+1, length($path));

        ## it's in the ascii range so it's a single byte
        ## character, only move forward one character
        if ($chr =~ /[\x00-\x7f]/) {

        ## first is dbl byte, skip following character which
        ## is part of the dbl character
        $loc++; $loc++;

This basically walks each character looking for magic hex pattern before '/' (5C) or '\' (2F) if (815C, 825C, 835C... range), since shift jis uses as part of the character. What A Pain in the ass...

But the blessed UTF8 does not require any of that crap. Yeah! And now I can use (at least for Perl) standard modules. See:

    my $path = shift;

    ## only change for mac or win. (unix ok)
    if (isWin) {
    } elsif (isMac) {

    ## default to unix since it rules...
    my $b = basename($path);
    fileparse_set_fstype("Unix"); ## reset it to avoid later issues

Happiness... until the wreslting with content dispositions and utf8, another story for another day..

Wednesday March 02, 2005
05:01 PM

Funky browser evilness (on IE mac users though)

As I continue my wandering through the localization and globalization magical land (I want the mushrooms!), I've run into an issue with an older system that just underwent a very cool UI update.

It seems that Mac IE while using shift-jis character encoding does not like stylesheets that includes rule that set western style fonts on form inputs. It basically causing the value inside the text input or select to revert to some other encoding (like MacRoman?). Funny enough, we listing a select (a drop down of values) all the values get display as japanese characters. On the other hand, in a more predicible way (much like a horny American Bison, scroll to Bison section), that setting the same rule for the body does nothing...

Here is an example of the offending rule:
.fInput { font-family: Arial, Helvetica, sans-serif; }

I've found creating a special sheet (like the small bus), for only IE mac with the font family of 'Osaka' fixes the problem.
.fInput { font-family: Osaka; }

Another one for the Perl (notice Randal) programmer to fix not the UI/web designer (how does this happen?).
Friday February 25, 2005
03:46 PM

Figuring out if text is UTF8

Well for the last couple of days, I've struggled in figuring out how to have Perl tell me that the current string inside a scalar is actually UTF8 or something else.

The first thing I tried was using the internal 'utf::valid' command. Well according to this everything (including values I knew where shift jis) was valid utf8. Later, I found (in some very useful documentation) that this will only tell you what Perl is storing it as not if the value is actually UTF8. But thanks to a very nice entry in the perluniintro page, that you can figure out if something is utf8 by simple decoding it. If it doesn't work that the value is not utf8. The Encode module is useful for that.

One other bit I've learned working with UTF8, shift JIS and other character encodings. It pays to use test values in URI (or HTML escaped) strings, then you can unescape them before your test script (or main application code) messes with the string. Then you can escape them to prevent problems with older (or basic) terms (xterm, my redhat 7.2 machine, etc.). Must better than having to pipe the output to less or xod. Also, it makes it easy to grab a html escape value from a logfile and then pass it as a command line arg to your test script (with unescapes it).

Just a couple thoughts for the end of the week.

oops... I meant Perl ;)