Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Journal of IlyaM (2933)

Thursday June 12, 2003
04:41 PM

UTF-8 fun in Perl

[ #12775 ]
List of minor and not very minor annoynances I've meet in last ~3 days while doing conversion to UTF-8 of the project I'm working on:
  1. Perl 5.6.x is just broken when it comes to Unicode support what means if you need Unicode support in Perl you must upgrade. And I though I'd wait for 5.8.1 before upgrading. Naive me :( - had to upgrade my and other developers computers and a production server.
  2. Seems XS modules in 5.8.0 don't play well with UTF-8 strings. Examples: if you give an UTF-8 string to Text::CSV_XS it returns you a non-UTF-8 string, if Template Toolkit configured to use Template::Stash::XS then UTF-8 strings may not work in templates.
  3. None of Perl modules/templating systems/etc I know do the right thing when URL-escaping UTF-8 strings.
  4. If you expect UTF-8 strings in query parameters and URLs you have to wrap Apache::Request to convert query parameters and URLs transparently in Perl UTF-8 strings. If would be nice if it such feature were built-in in this module (I wonder if CGI, CGI::Simple, etc have same problem).
  5. $hash{bareword} doesn't work if the bareword is a non-ASCII bareword (which should work as I understand when you have use utf8 in your Perl code). What is also interesting that it doesn't produce any warnings or errors - it just silently returns undef.
  6. GraphViz doesn't work correctly with UTF-8 strings - it seems to generate correct string in .dot format but when the module calls IPC::Run the string corrupts while being passed to the 'dot' program.

.. I expect to meet even more problems, annoyances ang just glitches.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Unfortunately, Unicode and perl still isn't as good as it should be. I've had lots of problems too.

    My current favourite is POSTing XML to a server using lwp. You send the XML, it looks fine from the client, but when the server reads it in, it's got the final few characters chopped off. Why? Because when LWP is calculating the Content-Length header, it's getting the length in characters not bytes. So you have to make sure that you convert to bytes before you use LWP to send information across a network. Bah!

    What's even more annoying about talking with LWP is the fact that the fix is different for 5.6 to 5.8. In 5.6.1, you use pack/unpack. In 5.8, I use encode_utf8.

    The other real nuisance we've had is DBD::Pg. When it returns strings from the database, they don't have the UTF-8 bit turned on. So you end up with doubly encoded errors when you try to output them. I've put a patch into DBD::Pg that lets you fix this for now, but it's not a particularly pretty solution (it should detect the database encoding and use that).

    I have no idea what state the other DBD:: modules are in. When I was trying to get DBD::Pg working, I took a look at a couple and didn't see any calls to SvUTF8_on() or similiar, so I suspect that it's not handled.

    More generally, perl 5.6.1 tended to hide unicode problems from you because it didn't have the IO layers that 5.8 does. Because you can tell 5.8 you're going to be sending out UTF-8, you end up with all sorts of double-encoding bugs if you're not careful, most of which would not have happened under 5.6.1. This leads people to needlessly think that 5.8 is broken, when in fact it's the 3rd party software that is bust.

    You mention getting UTF8 into URIs not working. That's because there's no defined standard for doing so. Trying to get that working is a lost cause, I feel, until people agree on how it's going to work. At present, there's no way to indicate what character set is in use in a URI.

    The next area I want to look at is getting UTF-8 correctly from a POST request (GET's are unlikely to work, given the above para). I have no idea how to force a client to give us stuff in the correct character encoding. And then I have no idea how to make Apache::Request or CGI do the right thing. sigh. More hard work, which thankfully I've been able to avoid until now.

    -Dom

    • You mention getting UTF8 into URIs not working. That's because there's no defined standard for doing so.

      HTML 4.01 spec says [w3.org]:

      We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases: 1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes. 2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

      If you enter non-ASCII chars both latest vers

      --

      Ilya Martynov (http://martynov.org/ [martynov.org])