Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Journal of jjore (6662)

Thursday January 07, 2010
12:43 AM

Unicode URLs, wtf?

[ #40081 ]

Hey internet, ⠸⠙⠱ ⠝⠉⠁⠈ ⠅⠝⠁⠕⠕⠉⠃ ⠝⠆⠏⠍⠞?

A year or more ago I was fixing work's web site to handle Unicode as entered by users into fields. We don't use CGI.pm because....? Well ok, we just don't. It also doesn't handle Unicode properly either. Or at least almost no version. Huh.

If a user types "Coatıcook" you'll probably get the dotless "i" character as either %C4%B1 or %u131 but CGI.pm as supplied by perl almost most of the time won't do something reasonable.

  • not ok 5.11.3 CGI-3.48
  • not ok 5.10.1 CGI-3.43
  • ok 5.10.0 CGI-3.29
  • not ok 5.8.9 CGI-3.42
  • not ok 5.6.2 CGI-2.752

Wut?

for v in 5.11.3 5.10.1 5.10.0 5.8.9 5.6.2;do
  /opt/perl-$v-64-thr-dbg/bin/perl\
    -le '
      use CGI;
 
      my $input  = "a=%u2021";
      my $expect = "\x{2021}";
      my $got = CGI->new( $input )->param( "a" );
 
      print $expect eq $got
        ? "ok $] $CGI::VERSION"
        : "not ok $] $CGI::VERSION"
    ';
done

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • CGI.pm decodes the non-standard (and invalid according to RFC 3986) pct escape into a UTF-8 octet string, but it doesn't decode it into perl unicode string. I think the current behavior is desirable since the data can contain any octets in any encoding.

    --
    chansen

  • > %u131

    What sort of encoding is that? I mean, I can see it's the Unicode codepoint preceded by %u, but which standard backs this? I've never encountered this before.

    Here's my take on it:

    use CGI qw();
    use Encode qw(decode_utf8);

    my $input  = 'a=%C4%B1';
    my $expect = "\x{131}";
    my $got    = decode_utf8(CGI->new($input)->param('a'));
    # as per best practice http://search.cpan.org/perldoc?CGI#-utf8

    use Devel::Peek qw(Dump); Dump $expect; Dump $got;

    print $expect eq $got
      ? "ok $] $

    • It usually comes from broken javascript applications that uses escape() instead of encodeURI()


      escape("\u263A") -> %u263A
      encodeURI("\u263A") -> %E2%98%BA

      --
      chansen

  • Did you try using use 'CGI qw/ :utf8 /;'? That seems to work the way you want with CGI 3.49 (at least it seems to on my box).