Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

jtrammell (6222)

Journal of jtrammell (6222)

Tuesday August 01, 2006
11:04 AM

Regex for UTF-8 octets (from perlunicode)

[ #30499 ]
From "perldoc perlunicode":

Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte

U+0000..U+007F       00..7F
U+0080..U+07FF       C2..DF    80..BF
U+0800..U+0FFF       E0        A0..BF    80..BF
U+1000..U+CFFF       E1..EC    80..BF    80..BF
U+D000..U+D7FF       ED        80..9F    80..BF
U+D800..U+DFFF       ******* ill-formed *******
U+E000..U+FFFF       EE..EF    80..BF    80..BF
U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF

And the equivalent regex:

qr{
        (?:
                                                [\x00-\x7f]  #   U+0000 .. U+007F
        |
                                    [\xc2-\xdf] [\x80-\xbf]  #   U+0080 .. U+07FF
        |
                               \xe0 [\xa0-\xbf] [\x80-\xbf]  #   U+0800 .. U+0FFF
        |
                        [\xe1-\xec] [\x80-\xbf] [\x80-\xbf]  #   U+1000 .. U+CFFF
        |
                               \xed [\x80-\x9f] [\x80-\xbf]  #   U+D000 .. U+D7FF
        |
                        [\xee-\xef] [\x80-\xbf] [\x80-\xbf]  #   U+E000 .. U+FFFF
        |
                   \xf0 [\x90-\xbf] [\x80-\xbf] [\x80-\xbf]  #  U+10000 .. U+3FFFF
        |
            [\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf]  #  U+40000 .. U+FFFFF
        |
                   \xf4 [\x80-\x8f] [\x80-\xbf] [\x80-\xbf]  # U+100000 .. U+10FFFF
        )
}x;

This has proven useful as I search for errant Latin-1 characters embedded in some files.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.