Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

Ovid
  (email not shown publicly)
http://publius-ovidius.livejournal.com/
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Friday May 13, 2005
01:34 PM

Unicode help needed?

[ #24672 ]

Ordinarily, I would just as theory how to handle this, but he's going to be gone for a bit (for reasons that I note he doesn't appear to have blogged, so I'll remain mum for a bit and hold off on the congratulations). I am importing some data from a MySQL database and am getting output that looks like this:

Scott Sterling^@
Mimi ValdÃ~CFFÃ~C,Ã~B©s
Kevin R. Scott
Delphine A Fawundu-
Mariel ConcepciÃ~CFFÃ~C,Ã~B³n

Does anyone know what that stuff is and how I might be able to convert that to properly escaped HTML entities?

Update: solved. Much grief led me to create the following subroutine (there were HTML tags embedded, too):

sub scrub_text {
    my $html = shift;
    my $parser = HTML::TokeParser::Simple->new(string => $html);
    my $text = '';
    while (my $token = $parser->get_token) {
        $text .= $token->as_is unless $token->is-tag;
    }
    $text =  encode_entities($text, "\200-\377");
    $text =~ s/[\r\n]/ /g;
    $text =~ s/[^[:print:]]//g;
    $text =~ trim($text);
    return $text;
}

Fortunately, this is a one-time import, so I don't have to worry too much about performance.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • The junk which starts with A-tilde is almost certainly UTF-8 being displayed as Latin-1. First, convert the UTF-8 octets to Unicode string with Encode::decode. Then HTML::Entities should give you the proper Unicode numeric entities.