Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

  (email not shown publicly)
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Wednesday February 20, 2008
05:28 AM

Three levels of Perl/Unicode understanding

[ #35700 ]

(Editorial: Don't frontpage this post, editors. I write it down here to summarize my thought, wanting to get feedbacks from my trusted readers and NOT flame wars or another giant thread of utf-8 flag woes)

I can finally say I fully grok Unicode, UTF-8 flag and all that stuff in Perl just lately. Here are some analysis of how perl programmers understand Unicode and UTF-8 flag stuff.

(This post might need more code to demonsrate and visualize what I'm talking about, but I'd leave it as a homework for readers, or at least thing for me to do until YAPC::Asia if there's a demand for this talk :))

Level 1. "Take that annoying flag off, dude!"

They, typically web application developers, assume all data is encoded in utf-8. If they encounter some wacky garbaged characters (a.k.a Mojibake in Japanese) which they think is a perl bug, they just make an ad-hoc call of:


to take the utf-8 flag off and make sure all data is still in utf-8 by avoiding any possible latin-1-utf8 auto upgrades.

This is level 1. Unfortunately, this works okay, assuming their data is actually encoded only in utf-8 (like database is utf-8, web page is displayed in utf-8, the data sent from browsers is utf-8 etc.). Their app is still broken when they call things like length(), substr() or regular expression because the strings are not UTF-8 flagged and those functions don't work in Unicode semantics.

They can optionally use "use encoding 'utf-8'" or CPAN module encoding::warnings to avoid auto-upgrades at all, or catch such mistakes, or use Unicode::RecursiveDowngrade to turn off UTF-8 flag on complex data structure.

Level 2. "Unicode strings have UTF-8 flags. That's easy"

They make an extensive use of Encode module encode() and decode() to make sure all data in their app is UTF-8 flagged. Their app works really nice in Unicode semantics.

They sometimes need to deal with UTF-8 bytes in addition to UTF8-flagged strings. In that case, they use some hacky modules named ForceUTF8, or do things like

utf8::encode($_) if utf8::is_utf8($_)

to assume that "Unicode strings should have UTF-8 flagged, and those without the flag are assumed UTF-8 bytes."

This is Level 2. This is a straight upgrade from Level 1 and fixes some issues of Level 1 (string functions not working in Unicode semantics, etc.), but it's still too UTF-8 centric. They ignore why perl5 treats strings this way, and still hate SV Auto-upgrade.

To be honest I was thinking this way until, like early 2007. There's a couple of my modules on CPAN that accepts both UTF-8 flagged string and UTF-8 bytes, because I thought it'd be handy, but actually that breaks latin-1 strings if they're not utf-8 flagged, which is rare in UTF-8 centric web application development anyway, but still could happen.

I gradually have changed my mind when I talked about how JSON::Syck Unicode support is broken with Marc Lehmann, and when I read the tutorial by and attended to the Perl Unicode tutorial talk by Juerd Waalboer in YAPC::EU.

Level 3. "Don't bother UTF-8 flag"

They stop guessing if a variable is UTF-8 flagged or not. All they need to know is that a string is whether bytes or characters, by checking how a scalar variable is generated.

If it's bytes, use decode() to get Unicode strings. If it's characters, don't bother if it's UTF-8 flagged or not: if it's not flagged they'll be auto-upgraded thanks to Perl, so you don't need to know the internal representations.

So it's like a step back from Level 2. "Get back to the basic, and think why Perl 5 does this latin-1 to utf-8 auto upgrades."

If your function or module needs to accept strings that might be either characters or bytes, just provide 2 different functions, or some flag to explicitly set. Don't auto-decode bytes as utf-8 because that breaks latin-1 characters if they're not utf-8 flagged. Of course the caller of the module can call utf8::upgrade() to make sure, but it's just a pain and anti-perl5 way.

There's still a remaining problem with CPAN modules, though. Some modules return strings in some occasion and not otherwise. For instance, $c->req->param($foo) would return UTF-8 flagged string if Catalyst::Plugin::Unicode is loaded and bytes otherwise. And using utf8::is_utf8($_) here might cause bugs like described before.

Well, in C::P::Unicode example, actually not. using C::P::Unicode guarantees that parameters are all utf-8 flagged even if the characters contain latin-1 range characters. Not using the plugin guarantees the parametes are not flagged at all. So it's a different story.

(To be continued...)

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • That's a very clear explanation of what I've thought for some time, but was unable to phrase.

    A quick way to upgrade yourself from level 2 to 3 is reading [] ;-).
    • Thank *you* for the good tutorial.

      So, I've been thinking there need to be some standards for CPAN modules to declare if it accept/return strings or bytes. (If they need to handle both)

      For instance, HTML::Parser has an instance method called utf8_mode [].

      Another example (that triggered me to write this entry) is Catalyst's uri_for() method []. At some release the developers changed the implementation to accept only strings (UTF-8 flagged or not) in its %query_values hash.

      Based on the complaints and patches made by
      • Strings or bytes is not the right distinction, because both kinds are strings. I usually call them "text string" and "binary string", or "character string" and "byte string". Sometimes I call the former "Unicode string" to emphasize that all text strings are Unicode strings.

        A trap is the UTF-8 string, which is a byte string representing characters, and has "the flag" off (which to perluninewbies is confusing because this flag is called UTF8). Compare this with the result of pack "N*", LIST, which is a byte
        • Hm, just to clarify, I prefer to use characters vs. bytes like you say. If I sometimes use "strings" somewhere, it's just a slip of keystrokes, or I meant Unicode strings instead.

          And also, I'm a bit afraid that you misunderstood what I meant with mention to I didn't mean we should call "use bytes" in this situation to force string operations to be bytes-wise. Not at all.

          I meant declaring "use bytes" *might be* a good way for programmers to tell the module authors "Hey I want this module to do what
          • "use bytes;" is lexical: it cannot influence what a module does. I don't know who to thank for this, but I'm happy that at least my code won't be broken at a distance by the numerous uninformed and misinformed people who throw a "use bytes" at their code to replace one kind of (for them) vague behavior with another kind of vague behavior. :)

            Experience has show so far that the only workable way of supporting both byte strings and text strings in your function, is to provide two separate functions, or a mecha
            • Agreed in both: we should use two different functions to accept characters or bytes, and also would be useful to DWIM. :)