Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

jk2addict (4946)

jk2addict
  (email not shown publicly)
http://today.icantfocus.com/blog/
AOL IM: chrislaco (Add Buddy, Send Message)
Yahoo! ID: chrislaco@sbcglobal.net (Add User, Send Message)
Jabber: laco@scribblewerks.com

Journal of jk2addict (4946)

Sunday March 20, 2005
05:53 PM

Perl/UTF Madness

[ #23760 ]

OK, this one has me stumped. I have a solution, but I want to know why I have to use it.

I've got a simple little method that returns the output from Locale::Currency::Format::urrency_symbol:

sub symbol {
        my ($code, $options) = @_;

        $code ||= 'USD';
        $options ||= 'SYM_UFT';

        eval '$options = ' . $options;

        return currency_symbol($code, $options);
};

The output of this method is getting output into AxKit. Let's assume I'm going to ask for the JPY symbol (Yen). Under perl 5.6.1, I get the expected symbol. No use of 'use utf8' in my module or in L::C::F. Everyone is happy.

Under 5.8.4 however, all I get is a stinking ?. After some tinkering, this fix make the yen symbol show up under 5.8.4 too:

use utf8; ...
sub symbol {
        my ($code, $options) = @_;

        $code ||= 'USD';
        $options ||= 'SYM_UFT';

        eval '$options = ' . $options;

        my $symbol = currency_symbol($code, $options);
        utf8::upgrade($symbol);

        return $symbol;
};

Now, my question is for someone who is intimate with the perl internal.. WHY? :-)

Upon the adivce of a fellow PerlMonk, I did a Devel::Peek dump of the scalar returned by currency_symbol. The first two are with no magic, the 3rd is with the fix under 5.8.4:

--------------
5.6.1
--------------
SV = PV(0x14045dc) at 0x1409e8c
    REFCNT = 1
    FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
    PV = 0x142d9fc "\302\245"\0
    CUR = 2
    LEN = 3

--------------
5.8.4
--------------
SV = PV(0x44c3d64) at 0x10590f4
    REFCNT = 1
    FLAGS = (PADBUSY,PADMY,POK,pPOK)
    PV = 0x450ab24 "\245"\0
    CUR = 1
    LEN = 2

-----------------
5.8.4 w/ upgrade
-----------------
SV = PV(0x44f91dc) at 0x104d644
    REFCNT = 1
    FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
    PV = 0x4518aa4 "\302\245"\0 [UTF8 "\x{a5}"]
    CUR = 2
    LEN = 3

So, now what? Is my fix appropariate? I imagine some level of tinkering withg the L::C::Format source would yield a fix as well, but that's not really an option to expect everyone to go through that.

Here's my guess:
http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Byte-and-Character-Semanti cs

"However, as an interim compatibility measure, Perl aims to provide a safe migration path from byte semantics to character semantics for programs. For operations where Perl can unambiguously decide that the input data are characters, Perl switches to character semantics. For operations where this determination cannot be made without additional information from the user, Perl decides in favor of compatibility and chooses to use byte semantics."

So it's can guess well in 5.8 with \x{}, su I have to give it the hint.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Sounds like the Locale::Foo::Bar module is returning raw bytes not marked as UTF8, something like you'd get from doing chr(254) . chr(76). In 5.8.x UTF8-land, you want to do something like chr(2687) or "\x{2A12}". The difference in the latter two is that when you do it that way (in 5.8.x+) Perl _knows_ that the string is UTF8, so things like regexes, length(), substr() and so on operate at a character level, not a byte level.
    • That's what confuses me. Locale::Currency::Format appears to be returning me the symbol from a private array of its' using \x{00a5}. It should just work, but it doesn't in 5.8.4 ... 5.6.1 is just happy.

      In either case, utf8:upgrade fixes the problem for me since I can't rely on the installs of Locale::Format::Currency to do the right thing... whatever that is in 5.8.4