Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Lecar_red (5694)

Lecar_red
  (email not shown publicly)

Journal of Lecar_red (5694)

Tuesday March 08, 2005
05:59 PM

why UTF8 is wonderful

[ #23554 ]

I have been writing/supporting a localized and globalized web application that used(still uses) shift jis for japanese character encoding. Our newest webapps uses many better technologies (Mason which I really really really like) and UTF8 for character encoding inside both the middleware and web app when handling of text and filenames. We do a lot of file processing (up and down). With that amount of filename processing (including striping off path or renaming when filenames exceed lengths, I ran into many shift jis characters that required special processing to protect them.

We have a couple of basename (subs, functions, etc. depending upon language) that detect and guard against shift jis slamming. Generally they follow this form:

    my $bn; ## basename string
    while ($loc <= length($path)) {
        my $chr = substr($path, $loc, 1);

        ## grab the basename if we match the
        ## directory sep
        if ($chr eq $sep) {
            $bn = substr($path, $loc+1, length($path));
            $loc++;
            next;
        }

        ## it's in the ascii range so it's a single byte
        ## character, only move forward one character
        if ($chr =~ /[\x00-\x7f]/) {
            $loc++;
            next;
        }

        ## first is dbl byte, skip following character which
        ## is part of the dbl character
        $loc++; $loc++;
    }

This basically walks each character looking for magic hex pattern before '/' (5C) or '\' (2F) if (815C, 825C, 835C... range), since shift jis uses as part of the character. What A Pain in the ass...

But the blessed UTF8 does not require any of that crap. Yeah! And now I can use (at least for Perl) standard modules. See:

    my $path = shift;

    ## only change for mac or win. (unix ok)
    if (isWin) {
        fileparse_set_fstype("MSWin32");
    } elsif (isMac) {
        fileparse_set_fstype("MacOS");
    }

    ## default to unix since it rules...
    my $b = basename($path);
    fileparse_set_fstype("Unix"); ## reset it to avoid later issues
    return($b);

Happiness... until the wreslting with content dispositions and utf8, another story for another day..

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.