Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Seeing as you didn't ask for it. ;-)

    Always use UTF-8 if you possibly can. It's (more-or-less) a superset of everything else, and it's properly detectable.

    If you're looking for interesting encodings, I'd recommend checking out one of the Shift-JIS [wikipedia.org] things. Just for weirdness. Personally, I've little experience of non-western encodings.

    For more concrete use cases to cover with encoding, you should look at:

    • query parameters coming in from browsers
    • POSTed form parameters coming in from a browser.
    • What e
    • Can you tell me more about the command line and environment variable problem? I think I'll have the other ones covered, but I'd like to know how you solved that one. I don't recall reading anything about how Perl will treat those.

      • Personally, I'd just pipe those things through Encoding::FixLatin and enjoy the utf8ness it emits :-)
        • Personally, I'd just pipe those things through Encoding::FixLatin and enjoy the utf8ness it emits :-)

          Interesting. I could have used that module a couple of years ago. Since then I've been using this trick to convert UTF-8-or-CP1252 byte strings to UTF-8 text strings:

              use Encode qw(decode);
              use Encode::Guess;

              my $line = <>;
              my $utf8 = guess_encoding($line, 'utf8');
              $line = ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line);

          http://search.cpan.org/perldoc?POE::Component [cpan.org]

      • It's controlled through the -C flag (see perlrun). Here's an example of using U+0100 (Ā) on the command line. The file contains the word "Ādam".

        $ mate ~/Desktop/adam.txt
        $ adam=$(<~/Desktop/adam.txt)
        $ xxd ~/Desktop/adam.txt
        0000000: c480 6461 6d0a                           ..dam.
        $ perl -MDevel::Peek -le 'Dump $ARGV[0]' $adam
        SV = PV(0x801168) at 0x800954
          REFCNT = 1
          FLAGS = (POK,pPOK)
          PV = 0x2044f0 "\304\20

      • Command line arguments come in as raw bytes. So you have to detect the codeset of the user's environment and encode if necessary. Roughly like this:

        use I18N::Langinfo qw(langinfo CODESET);
        use Encode qw(decode);
        my $codeset = langinfo(CODESET);
        for (@ARGV) { $_ = decode $codeset, $_ }

        Likewise for environment variable values.