(Editorial: Don't frontpage this post, editors. I write it down here to summarize my thought, wanting to get feedbacks from my trusted readers and NOT flame wars or another giant thread of utf-8 flag woes)
I can finally say I fully grok Unicode, UTF-8 flag and all that stuff in Perl just lately. Here are some analysis of how perl programmers understand Unicode and UTF-8 flag stuff.
(This post might need more code to demonsrate and visualize what I'm talking about, but I'd leave it as a homework for readers, or at least thing for me to do until YAPC::Asia if there's a demand for this talk
Level 1. "Take that annoying flag off, dude!"
They, typically web application developers, assume all data is encoded in utf-8. If they encounter some wacky garbaged characters (a.k.a Mojibake in Japanese) which they think is a perl bug, they just make an ad-hoc call of:
Encode::_utf8_off($stuff)
to take the utf-8 flag off and make sure all data is still in utf-8 by avoiding any possible latin-1-utf8 auto upgrades.
This is level 1. Unfortunately, this works okay, assuming their data is actually encoded only in utf-8 (like database is utf-8, web page is displayed in utf-8, the data sent from browsers is utf-8 etc.). Their app is still broken when they call things like length(), substr() or regular expression because the strings are not UTF-8 flagged and those functions don't work in Unicode semantics.
They can optionally use "use encoding 'utf-8'" or CPAN module encoding::warnings to avoid auto-upgrades at all, or catch such mistakes, or use Unicode::RecursiveDowngrade to turn off UTF-8 flag on complex data structure.
Level 2. "Unicode strings have UTF-8 flags. That's easy"
They make an extensive use of Encode module encode() and decode() to make sure all data in their app is UTF-8 flagged. Their app works really nice in Unicode semantics.
They sometimes need to deal with UTF-8 bytes in addition to UTF8-flagged strings. In that case, they use some hacky modules named ForceUTF8, or do things like
utf8::encode($_) if utf8::is_utf8($_)
to assume that "Unicode strings should have UTF-8 flagged, and those without the flag are assumed UTF-8 bytes."
This is Level 2. This is a straight upgrade from Level 1 and fixes some issues of Level 1 (string functions not working in Unicode semantics, etc.), but it's still too UTF-8 centric. They ignore why perl5 treats strings this way, and still hate SV Auto-upgrade.
To be honest I was thinking this way until, like early 2007. There's a couple of my modules on CPAN that accepts both UTF-8 flagged string and UTF-8 bytes, because I thought it'd be handy, but actually that breaks latin-1 strings if they're not utf-8 flagged, which is rare in UTF-8 centric web application development anyway, but still could happen.
I gradually have changed my mind when I talked about how JSON::Syck Unicode support is broken with Marc Lehmann, and when I read the tutorial by and attended to the Perl Unicode tutorial talk by Juerd Waalboer in YAPC::EU.
Level 3. "Don't bother UTF-8 flag"
They stop guessing if a variable is UTF-8 flagged or not. All they need to know is that a string is whether bytes or characters, by checking how a scalar variable is generated.
If it's bytes, use decode() to get Unicode strings. If it's characters, don't bother if it's UTF-8 flagged or not: if it's not flagged they'll be auto-upgraded thanks to Perl, so you don't need to know the internal representations.
So it's like a step back from Level 2. "Get back to the basic, and think why Perl 5 does this latin-1 to utf-8 auto upgrades."
If your function or module needs to accept strings that might be either characters or bytes, just provide 2 different functions, or some flag to explicitly set. Don't auto-decode bytes as utf-8 because that breaks latin-1 characters if they're not utf-8 flagged. Of course the caller of the module can call utf8::upgrade() to make sure, but it's just a pain and anti-perl5 way.
There's still a remaining problem with CPAN modules, though. Some modules return strings in some occasion and not otherwise. For instance, $c->req->param($foo) would return UTF-8 flagged string if Catalyst::Plugin::Unicode is loaded and bytes otherwise. And using utf8::is_utf8($_) here might cause bugs like described before.
Well, in C::P::Unicode example, actually not. using C::P::Unicode guarantees that parameters are all utf-8 flagged even if the characters contain latin-1 range characters. Not using the plugin guarantees the parametes are not flagged at all. So it's a different story.
(To be continued...)
Thanks (Score:2)
A quick way to upgrade yourself from level 2 to 3 is reading http://juerd.nl/site.plp/perluniadvice [juerd.nl]
Re: (Score:2)
So, I've been thinking there need to be some standards for CPAN modules to declare if it accept/return strings or bytes. (If they need to handle both)
For instance, HTML::Parser has an instance method called utf8_mode [cpan.org].
Another example (that triggered me to write this entry) is Catalyst's uri_for() method [cpan.org]. At some release the developers changed the implementation to accept only strings (UTF-8 flagged or not) in its %query_values hash.
Based on the complaints and patches made by
Strings or bytes (Score:2)
A trap is the UTF-8 string, which is a byte string representing characters, and has "the flag" off (which to perluninewbies is confusing because this flag is called UTF8). Compare this with the result of pack "N*", LIST, which is a byte
Re: (Score:2)
And also, I'm a bit afraid that you misunderstood what I meant with mention to bytes.pm. I didn't mean we should call "use bytes" in this situation to force string operations to be bytes-wise. Not at all.
I meant declaring "use bytes" *might be* a good way for programmers to tell the module authors "Hey I want this module to do what
Re: (Score:2)
Experience has show so far that the only workable way of supporting both byte strings and text strings in your function, is to provide two separate functions, or a mecha
Re: (Score:2)