Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Journal of LTjake (4001)

Tuesday March 21, 2006
11:04 AM

Finding typos in our catalog

[ #29062 ]

We're getting close to launching our new catalog browser. The old DB schema was a bloody mess, so in the process of making this new app we've refined the schema and tried to do as much data quality checking as we can.

Just last night i stumbled upon a nice list of common typos found in the OhioLINK catalog. I thought it might be worthwhile to check the list against our data.

I wrote a quick n' dirty script to spit out the list of words from that page and put it in the DATA section of the following script.

It simply tries a search on the catalogue and grabs any item ids it finds in the resulting response.

use constant CATALOG_URL => 'http://localhost:3000/search/?q=%s';
use constant ITEM_REGEX  => '\/item\/(\d+)';

use strict;
use warnings;

use LWP::UserAgent;
use List::MoreUtils qw( uniq );

$|++;

my $agent = LWP::UserAgent->new;
my $regex = ITEM_REGEX;
while( <DATA> ) {
    chomp;

    # remove strings in brackets and clean up whitespace
    s/[\[\(].+?[\]\)]//g;
    s/\s+$//;
    s/\s+/ /g;

    # query the catalog
    my $response = $agent->get( sprintf( CATALOG_URL, $_ ) );
    next unless $response->content;
    my @matches = ( $response->content =~ /$regex/gs );
    next unless @matches;

    # print the results
    print "$_: " . join( ', ', uniq @matches ) . "\n";
}

# Words taken from http://faculty.quinnipiac.edu/libraries/tballard/typoscomplete.html
# Regex: /<br><font color=#.{6}">(.+?)<!\(.+?\)><\/font>/gs
__DATA__
Accomodat*
Accordia*
Activi te*
Administat*
Administraton*
Adminstrat*
Amd
Archael*
Artic
Assocat*
A sss* [and not ass's]
Berkeley [and] Mass
Cby*
Cincinatti*
...

It hasn't gone through a full run yet, but so far out of about 3350 words, we've only matched about 105. Not bad.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.