At my last gig, I used several expensive commercial packages to solve what was called the merge/purge - deduplication problem.
I was always hoping for an open source solution, because the results were sometimes mysterious and access to people who understood the theory behind name and address matching was very limited. (I think the idea was that these were trade secrets)
At the time, I didn't know that the computer science term for the problem was record linkage, and that the bio-informatics community has been working on it from two directions: finding overlappng data in large strings (for human genome), and bringing together people's names for medical records and history.
This last one is particularly promising, since it's a close match to name and address issues. From that, I google'd a few record linkage projects, but the projects they discussed: AJAX and Potter's Wheel seemed to be more interactive projects and I'm more interested in large datasets. The source for the only AJAX I could find seems to be a java tool and the Potter's wheel project
seems to have been abandoned in 2000.
There's a group at the australian national university working on it now: as part of their data mining group
with a project called Febrl, but that's also in python and seems to be created by non-programmers. This looks like an interesting gap...
I like how they've avoided the deterministic approach to creating matching rules and are using hidden markov models to determining standardization and linkage rules.
My experience tells me that this should be the way to go, since matching rules are really based on the data anyway.
I'm going to try to follow the same approach, but using perl tools and maybe looking into the bio-perl community for performance ideas. I'll post the results here as I find them.
Some initial taks:
- Find a good, efficient, perl-centric hidden markov model implementation that doesn't eat too much memory. I've looked at a couple, most of them break into C, and while I'm a competent C programmer, I'm thinking it will make it less friendly over all.
- Find enough good sample data that is in the public domain. If I'm going to train a model, I'm going to need names and addresses. I've got to find a good US sample that isn't owned by anyone.
- Train the model, run some tests, and post the results.
I'll see how this goes in a few days.