Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Journal of LTjake (4001)

Wednesday May 21, 2003
01:13 PM

Fallback characters

[ #12357 ]

When entering data into our (*sigh* test) cataloguing system, generally an from automated MARC record extraction (yay MARC::Record!), some of the data gets normalized.

By normalized, i mean it's all converted to one case, punctuations are removed, accented characters are changed to a-z letter (ex: é => e), etc. We've tried to comply mostly to the NACO normalization rules.

A basic character fallback hash can be found in POD::Escapes [Thanks TorgoX!] (see %Latin1Char_to_fallback), which i've modified slightly. Are there any modules that are explicitly for designed to do this type of normalization?

I'm sure this code sucks, but... :)

# LUT to convert diacritical and special characters
# Modified from Pod::Escapes

my (%Latin1Code_to_fallback, %Latin1Char_to_fallback);
%Latin1Code_to_fallback = ();

@Latin1Code_to_fallback{0xA0 .. 0xFF} = ( ' ', qq{!}, qq{C/}, 'PS', qq{\$?}, qq{Y=}, qq{|}, 'SS', qq{"}, ' ', 'a', qq{<<}, qq{!}, "", ' ', qq{-}, ' ', ' ', '2', '3', qq{'}, 'u', 'P', qq{*}, qq{,}, '1', 'o', qq{>>}, qq{1/4}, qq{1/2}, qq{3/4}, qq{?}, 'A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'x', 'O', 'U', 'U', 'U', 'U', 'U', 'Th', 'ss', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'd', 'n', 'o', 'o', 'o', 'o', 'o', qq{/}, 'o', 'u', 'u', 'u', 'u', 'y', 'th', 'y', );

{
    %Latin1Char_to_fallback = ();
    my($k,$v);
    while( ($k,$v) = each %Latin1Code_to_fallback) {
        $Latin1Char_to_fallback{chr $k} = $v;
    }
}

# Massage data into a searchable value

sub normalize {
    my $data = shift;

    # Rules taken from NACO Normalization
    # http://lcweb.loc.gov/catdir/pcc/naco/normrule.html

    # Convert special chars to spaces
    $data =~ s/[\Q!(){}<>-;:.?,\/\\@*%=\$^~\E]/ /g;

    # Delete special chars
    $data =~ s/[\Q'[]|\E]//g;

    # Remove diacritical marks and convert special chars
    my @chars = split(//, $data);
    for my $i (0..$#chars) {
        $chars[$i] = $Latin1Char_to_fallback{$chars[$i]} if (ord($chars[$i]) >= 160 && ord($chars[$i]) <= 255);
    }
    $data = join('', @chars);

    # Convert lowercase to uppercase.
    $data =~ tr/a-z/A-Z/;

    # Remove leading and trailing spaces
    $data =~ s/^\s+|\s+$//g;

    # Condense multiple spaces
    $data =~ s/\s+/ /g;

    return $data;
}

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.