Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Tuesday December 09, 2003
04:10 AM

Perl 5.8 and Unicode problems

[ #16232 ]
Below is what I asked to autrijus, but if you have any suggestions on this issue, I'm open to them.

1. It's widely known that Jcode.pm has Unicode map problem that Full-Width-Tilde (U+FF5E) doesn't map well to euc-jp. It's due to the mistake of Unicode.org's own mapping table.

But with the recent Encode.pm, it still has problem:

% perl -MEncode -e 'print encode("euc-jp" "\x{ff5e}", Encode::FB_CROAK)'
"\x{ff5e}" does not map to euc-jp at /usr/lib/perl/5.8.2/Encode.pm line 149.

What does it mean? grepping ucm files shows:

% grep -i FF5E ucm/euc-jp.ucm
<UFF5E> \xA2\xB2 |3 # 1-2-18

Doesn't it mean that UFF5E maps to \xA2\xB2 in euc-jp?

2. What's the best practice in develop application in multi-encoding environment, like web+db+xml applications? It'd make me a mess while developing in such enviroment that:

  1. TT Template is written in euc-jp or utf-8
  2. one data is fetched via XML (RSS) in utf-8 or euc-jp
  3. another data is stored to and fetched from MySQL in utf-8
  4. HTTP requests come from mobile phones in Shift_JIS

Concatinating non-Unicode strings with Unicode strings raise UTF-8 Auto Upgrading and thus raw UTF-8 Strings get corrupted.

For example at least, how do I tell Template-Toolkit that template is written in euc-jp? It calls open() in its own modules, so binmode or encoding.pm, unless you open template files and pass its filehandle explicitly, which is not the case of mine.

I tend to think there should be encoding layers to all data-stream-handling modules like DBI, Template-Toolkit, CGI.pm (or Apache::Request) etc. Am I thinking right here?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • You should speak with Andy Wardley about your needs for Template Toolkit
    as he is being sponsored to work on the next version (TT3 [tt2.org]) for a few months.
  • I've tried to solve a lot of these problems with AxKit. In some places we've succeeded, in others it's a little more complex, but you can still get results.

    Partly it's the beauty of XML - that it has been written to explicitly handle different encodings cleanly.
  • See this thread [template-toolkit.org] which describes some hacks which allow to use UTF-8 encoding in TT files. I guess something simular can be used to use other charsets and convert them to UTF-8 on fly. As for Apache::Request to use it with UTF-8 I had to write my own wrapper [perl.org]. Again same idea can be used to use it with other charsets.
    --

    Ilya Martynov (http://martynov.org/ [martynov.org])