Stories
Slash Boxes
Comments

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Tuesday December 09, 2003
05:10 AM

Perl 5.8 and Unicode problems

[ #16232 ]
Below is what I asked to autrijus, but if you have any suggestions on this issue, I'm open to them.

1. It's widely known that Jcode.pm has Unicode map problem that Full-Width-Tilde (U+FF5E) doesn't map well to euc-jp. It's due to the mistake of Unicode.org's own mapping table.

But with the recent Encode.pm, it still has problem:

% perl -MEncode -e 'print encode("euc-jp" "\x{ff5e}", Encode::FB_CROAK)'
"\x{ff5e}" does not map to euc-jp at /usr/lib/perl/5.8.2/Encode.pm line 149.

What does it mean? grepping ucm files shows:

% grep -i FF5E ucm/euc-jp.ucm
<UFF5E> \xA2\xB2 |3 # 1-2-18

Doesn't it mean that UFF5E maps to \xA2\xB2 in euc-jp?

2. What's the best practice in develop application in multi-encoding environment, like web+db+xml applications? It'd make me a mess while developing in such enviroment that:

  1. TT Template is written in euc-jp or utf-8
  2. one data is fetched via XML (RSS) in utf-8 or euc-jp
  3. another data is stored to and fetched from MySQL in utf-8
  4. HTTP requests come from mobile phones in Shift_JIS

Concatinating non-Unicode strings with Unicode strings raise UTF-8 Auto Upgrading and thus raw UTF-8 Strings get corrupted.

For example at least, how do I tell Template-Toolkit that template is written in euc-jp? It calls open() in its own modules, so binmode or encoding.pm, unless you open template files and pass its filehandle explicitly, which is not the case of mine.

I tend to think there should be encoding layers to all data-stream-handling modules like DBI, Template-Toolkit, CGI.pm (or Apache::Request) etc. Am I thinking right here?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • You should speak with Andy Wardley about your needs for Template Toolkit
    as he is being sponsored to work on the next version (TT3 [tt2.org]) for a few months.
  • I've tried to solve a lot of these problems with AxKit. In some places we've succeeded, in others it's a little more complex, but you can still get results.

    Partly it's the beauty of XML - that it has been written to explicitly handle different encodings cleanly.
  • See this thread [template-toolkit.org] which describes some hacks which allow to use UTF-8 encoding in TT files. I guess something simular can be used to use other charsets and convert them to UTF-8 on fly. As for Apache::Request to use it with UTF-8 I had to write my own wrapper [perl.org]. Again same idea can be used to use it with other charsets.
    --

    Ilya Martynov (http://martynov.org/ [martynov.org])