Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • This doesn't help with addresses, but I created the Lingua-EN-MatchNames module to eliminate duplicate user records between security and groupware databases a few years ago, and it worked quite well.
    • Thanks Brian,

      Yes, I looked at it, as well as the excellent modules Lingua:EN:NameParse and Lingua::EN::AddressParse by Kim Ryan. I plan to use them once I get to the blocking window level of matches.

      The problem, of course, is that when you have many millions of records, the turn around time for a really close look at each record just gets too large. So what I think I need to do is determine how to split the records for large datasets into groups that can be compared in an economical amount of time (the blocking window) and make use of grid or lam style computing to the computationally intensive work needed to do the merging.

      Your module does cover almost everything needed once we have a list of candidates to work on, so when I'm at that stage, I'll see if I can submit any changes to you.

      Thanks, Mike Harding