Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

jozef (8299)

jozef
  (email not shown publicly)
http://jozef.kutej.net/
Jabber: jozef@kutej.net

Journal of jozef (8299)

Wednesday January 27, 2010
03:56 PM

Illegal character 0x1FFFF

[ #40136 ]
$ perl -le 'use warnings; my $x=chr(0x1FFFF)'
Unicode character 0x1ffff is illegal at -e line 1.

XML supports UTF-8 so I check for valid UTF-8 string and use it in XML if valid. Right? No!!!

There are some "non-illegal" characters that are perfect valid in UTF-8 (or even in the plain old ASCII), but are invalid for XML. The most obvious 0x00. Here is what W3C XML 1.0 specification say:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

I spend some time playing with it and the result is XML::Char->valid(). The dev Data::asXML is using it now. If you you want, have a look at the test suit and try to break it. :-)

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I'm sorry to disappoint you, but Perl_is_utf8_string can't be used to check for well-formed UTF-8. Perls utf8 encoding form is superset of Unicode and ISO/IEC 10646. Perls encoding form supports codepoints up to 2**64-1 and has no problems with encoded UTF-16 surrogates or any other permanently reserved codepoints.

  • Don’t look at the UTF8 flag. The UTF8 flag does not mean what you think it means. You can have a perfectly valid Unicode string that does not have its UTF8 flag set, and you can have a JPEG image in a string that does have its UTF8 flag set. The UTF8 flag is a lie. It should not have been called the UTF8 flag. There is no flag in Perl that means what you think the UTF8 flag means. Don’t look at the UTF8 flag.

    What you want to do is very simple:

    sub _is_valid_xml_string {
      $_[0] !~ /[^\x9\xA\