Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.
While doing some research a few days ago I found myself reading a paragraph that seemed very familiar. In digging around, I found the other news story I was looking for. Several sentences were duplicates and several were subtly altered, but it was the same paragraph. The stories, I might add, were over a year apart and were by different authors.
While it could very well be that this particular news source has an internal practice of allowing reporters to borrow copy from one another without attribution, I'm not aware this is a common practice (of course, I am not a journalist, either.) Further, with all of the recent high profile plagiarism cases, it seems less likely than ever that news organizations would tolerate this practice. In trying to research whether or not the reporter in question had plagiarized any other work, I quickly found that, while it's easy to compare two paragraphs, it's not easy to compare one story to hundreds of others. Automation is the way to go.
Many of the tools I found on the CPAN seemed too low level for this type of work, so I started writing Text::Plagiarized. It's not on the CPAN, nor is it available for download. However, after a bit of research, I found it was suprisingly easy to do a basic analysis (well, the code is easy to use. I threw away three implementations before I stumbled on the "easy" one.)
my $text = Text::Plagiarized->new;
$text->original($original_text);
foreach my $comparison (@comparison_texts) {
$text->comparison($comparison);
$text->analyze;
print $text->percent, $/; # percent of matching sentences
if ($text->percent > $some_threshold) {
# arrayref of array refs with [$sentence, $possible_match]
print Dumper($text->matches);
}
}
You can tweak how "sensitive" you want the matching to be, but so far, it handles fuzzy matching like the following two texts:
my ($text1, $text2) = (<<"END_FIRST", <<"END_SECOND");
This is some text that might be plagiarized. Whether or not it has
been can be difficult for a simple program to detect. The writer
may simply change a few words here and there. He or she might add
some extra punctuation or just throw in an extra sentence or two.
However they do it, there is usually some subtle difference between
the original and the copy.
END_FIRST
This text might be plagiarized. Whether or not it has been can be
difficult for a simple program to detect. The writer can simply
change a few words here and there or they might add some extra
punctuation. However they do it, there are usually subtle
differences between the original and the copy.
END_SECOND
At the default threshold (80% match), only the first sentence in those paragraphs fail to match. Merely setting the threshold to 74% will pick up that first sentence.
For some reason I feel a bit uncomfortable about releasing this. I'm not sure why. In any event, it's not done, so I have time to think about this. I don't account for mispellings or stemming, the interface might change, and it seems fairly fragile in odd corner cases.
Interesting Test Cases (Score:2)
a change of name ? (Score:1)
Perhaps you're looking at only one aspect of how a module like this may be used. Yes, it can be used for detecting plaigarism, should the user choose to do so. But it can also be used as a similarity detection metric; which has uses far beyond seeing if journalists borrowed copy or if students cribbed essays.
Related articles ? contextual matching ? I can think of a few more uses for this type of module. I'd actually like to see how you do it, out of academic interest.
Re:a change of name ? (Score:1)
-DA [coder.com]
Re:a change of name ? (Score:2)
Because of the way the code is designed, I seriously doubt that it could be used for related articles or contextual matching. It's slow, but that's because of the algorithm I chose (which turned out to be surprisingly faster than some of the other options I was looking at.) It does a sentence by sentence comparison to determine "how far apart" two sentences are in terms of insertions, deletions and replacement. If they're close enough (under the user defined threshold), then a match is reported. It's th
Re:a change of name ? (Score:1)
------------------------------
You are what you think.
You can bet... (Score:1)
Re:You can bet... (Score:1)