NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.
All the Perl that's Practical to Extract and Report
Thursday June 12, 2003
04:41 PM
UTF-8 fun in Perl
List of minor and not very minor annoynances I've meet in last ~3 days while doing conversion to UTF-8 of the project I'm working on:
- Perl 5.6.x is just broken when it comes to Unicode support what means if you need Unicode support in Perl you must upgrade. And I though I'd wait for 5.8.1 before upgrading. Naive me :( - had to upgrade my and other developers computers and a production server.
- Seems XS modules in 5.8.0 don't play well with UTF-8 strings. Examples: if you give an UTF-8 string to Text::CSV_XS it returns you a non-UTF-8 string, if Template Toolkit configured to use Template::Stash::XS then UTF-8 strings may not work in templates.
- None of Perl modules/templating systems/etc I know do the right thing when URL-escaping UTF-8 strings.
- If you expect UTF-8 strings in query parameters and URLs you have to wrap Apache::Request to convert query parameters and URLs transparently in Perl UTF-8 strings. If would be nice if it such feature were built-in in this module (I wonder if CGI, CGI::Simple, etc have same problem).
- $hash{bareword} doesn't work if the bareword is a non-ASCII bareword (which should work as I understand when you have use utf8 in your Perl code). What is also interesting that it doesn't produce any warnings or errors - it just silently returns undef.
- GraphViz doesn't work correctly with UTF-8 strings - it seems to generate correct string in .dot format but when the module calls IPC::Run the string corrupts while being passed to the 'dot' program.
.. I expect to meet even more problems, annoyances ang just glitches.
Stories, comments, journals, and other submissions on use Perl; are Copyright 1998-2006, their respective owners.
That's a good list but it's just the start... (Score:3, Interesting)
My current favourite is POSTing XML to a server using lwp. You send the XML, it looks fine from the client, but when the server reads it in, it's got the final few characters chopped off. Why? Because when LWP is calculating the Content-Length header, it's getting the length in characters not bytes. So you have to make sure that you convert to bytes before you use LWP to send information across a network. Bah!
What's even more annoying about talking with LWP is the fact that the fix is different for 5.6 to 5.8. In 5.6.1, you use pack/unpack. In 5.8, I use encode_utf8.
The other real nuisance we've had is DBD::Pg. When it returns strings from the database, they don't have the UTF-8 bit turned on. So you end up with doubly encoded errors when you try to output them. I've put a patch into DBD::Pg that lets you fix this for now, but it's not a particularly pretty solution (it should detect the database encoding and use that).
I have no idea what state the other DBD:: modules are in. When I was trying to get DBD::Pg working, I took a look at a couple and didn't see any calls to SvUTF8_on() or similiar, so I suspect that it's not handled.
More generally, perl 5.6.1 tended to hide unicode problems from you because it didn't have the IO layers that 5.8 does. Because you can tell 5.8 you're going to be sending out UTF-8, you end up with all sorts of double-encoding bugs if you're not careful, most of which would not have happened under 5.6.1. This leads people to needlessly think that 5.8 is broken, when in fact it's the 3rd party software that is bust.
You mention getting UTF8 into URIs not working. That's because there's no defined standard for doing so. Trying to get that working is a lost cause, I feel, until people agree on how it's going to work. At present, there's no way to indicate what character set is in use in a URI.
The next area I want to look at is getting UTF-8 correctly from a POST request (GET's are unlikely to work, given the above para). I have no idea how to force a client to give us stuff in the correct character encoding. And then I have no idea how to make Apache::Request or CGI do the right thing. sigh. More hard work, which thankfully I've been able to avoid until now.
-Dom
Reply to This
Re:That's a good list but it's just the start... (Score:2, Informative)
HTML 4.01 spec says [w3.org]:
If you enter non-ASCII chars both latest vers
Ilya Martynov (http://martynov.org/ [martynov.org])
Re:That's a good list but it's just the start... (Score:2)
-Dom
Re:That's a good list but it's just the start... (Score:2, Insightful)
Ilya Martynov (http://martynov.org/ [martynov.org])