Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Seeing as you didn't ask for it. ;-)

    Always use UTF-8 if you possibly can. It's (more-or-less) a superset of everything else, and it's properly detectable.

    If you're looking for interesting encodings, I'd recommend checking out one of the Shift-JIS [wikipedia.org] things. Just for weirdness. Personally, I've little experience of non-western encodings.

    For more concrete use cases to cover with encoding, you should look at:

    • query parameters coming in from browsers
    • POSTed form parameters coming in from a browser.
    • What e
    • Can you tell me more about the command line and environment variable problem? I think I'll have the other ones covered, but I'd like to know how you solved that one. I don't recall reading anything about how Perl will treat those.

      • Personally, I'd just pipe those things through Encoding::FixLatin and enjoy the utf8ness it emits :-)
        • Personally, I'd just pipe those things through Encoding::FixLatin and enjoy the utf8ness it emits :-)

          Interesting. I could have used that module a couple of years ago. Since then I've been using this trick to convert UTF-8-or-CP1252 byte strings to UTF-8 text strings:

              use Encode qw(decode);
              use Encode::Guess;

              my $line = <>;
              my $utf8 = guess_encoding($line, 'utf8');
              $line = ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line);

          http://search.cpan.org/perldoc?POE::Component [cpan.org]

      • It's controlled through the -C flag (see perlrun). Here's an example of using U+0100 (Ā) on the command line. The file contains the word "Ādam".

        $ mate ~/Desktop/adam.txt
        $ adam=$(<~/Desktop/adam.txt)
        $ xxd ~/Desktop/adam.txt
        0000000: c480 6461 6d0a                           ..dam.
        $ perl -MDevel::Peek -le 'Dump $ARGV[0]' $adam
        SV = PV(0x801168) at 0x800954
          REFCNT = 1
          FLAGS = (POK,pPOK)
          PV = 0x2044f0 "\304\20

      • Command line arguments come in as raw bytes. So you have to detect the codeset of the user's environment and encode if necessary. Roughly like this:

        use I18N::Langinfo qw(langinfo CODESET);
        use Encode qw(decode);
        my $codeset = langinfo(CODESET);
        for (@ARGV) { $_ = decode $codeset, $_ }

        Likewise for environment variable values.

  • Jag önskar dig forsatt trevlig läsning och har det sÃ¥ bra! Translation: I wish you more fun reading and take it easy!
  • My recommendation is to avoid \N and \x escapes except for whitespace and combining characters. Literal characters that can be read immediately and copy-pasted anywhere are much more useful.

    »Perl« is a proper name and is not translated (I haven't even seen it transliterated where it would be possible), »monger« is also very difficult to translate because of its multiple denotations in English (of course that word was picked deliberately for this reason). Can you substitute something easi