Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Lecar_red (5694)

Lecar_red
  (email not shown publicly)

Journal of Lecar_red (5694)

Friday February 25, 2005
03:46 PM

Figuring out if text is UTF8

[ #23390 ]

Well for the last couple of days, I've struggled in figuring out how to have Perl tell me that the current string inside a scalar is actually UTF8 or something else.

The first thing I tried was using the internal 'utf::valid' command. Well according to this everything (including values I knew where shift jis) was valid utf8. Later, I found (in some very useful documentation) that this will only tell you what Perl is storing it as not if the value is actually UTF8. But thanks to a very nice entry in the perluniintro page, that you can figure out if something is utf8 by simple decoding it. If it doesn't work that the value is not utf8. The Encode module is useful for that.

One other bit I've learned working with UTF8, shift JIS and other character encodings. It pays to use test values in URI (or HTML escaped) strings, then you can unescape them before your test script (or main application code) messes with the string. Then you can escape them to prevent problems with older (or basic) terms (xterm, my redhat 7.2 machine, etc.). Must better than having to pipe the output to less or xod. Also, it makes it easy to grab a html escape value from a logfile and then pass it as a command line arg to your test script (with unescapes it).

Just a couple thoughts for the end of the week.

oops... I meant Perl ;)

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.