Stories
Slash Boxes
Comments

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

Ovid
  (email not shown publicly)
http://publius-ovidius.livejournal.com/
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Friday May 13, 2005
02:34 PM

Unicode help needed?

[ #24672 ]

Ordinarily, I would just as theory how to handle this, but he's going to be gone for a bit (for reasons that I note he doesn't appear to have blogged, so I'll remain mum for a bit and hold off on the congratulations). I am importing some data from a MySQL database and am getting output that looks like this:

Scott Sterling^@
Mimi ValdÃ~CFFÃ~C,Ã~B©s
Kevin R. Scott
Delphine A Fawundu-
Mariel ConcepciÃ~CFFÃ~C,Ã~B³n

Does anyone know what that stuff is and how I might be able to convert that to properly escaped HTML entities?

Update: solved. Much grief led me to create the following subroutine (there were HTML tags embedded, too):

sub scrub_text {
    my $html = shift;
    my $parser = HTML::TokeParser::Simple->new(string => $html);
    my $text = '';
    while (my $token = $parser->get_token) {
        $text .= $token->as_is unless $token->is-tag;
    }
    $text =  encode_entities($text, "\200-\377");
    $text =~ s/[\r\n]/ /g;
    $text =~ s/[^[:print:]]//g;
    $text =~ trim($text);
    return $text;
}

Fortunately, this is a one-time import, so I don't have to worry too much about performance.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • The junk which starts with A-tilde is almost certainly UTF-8 being displayed as Latin-1. First, convert the UTF-8 octets to Unicode string with Encode::decode. Then HTML::Entities should give you the proper Unicode numeric entities.