Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

  (email not shown publicly)
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Sunday June 05, 2005
11:15 PM


[ #25050 ]

While doing some research a few days ago I found myself reading a paragraph that seemed very familiar. In digging around, I found the other news story I was looking for. Several sentences were duplicates and several were subtly altered, but it was the same paragraph. The stories, I might add, were over a year apart and were by different authors.

While it could very well be that this particular news source has an internal practice of allowing reporters to borrow copy from one another without attribution, I'm not aware this is a common practice (of course, I am not a journalist, either.) Further, with all of the recent high profile plagiarism cases, it seems less likely than ever that news organizations would tolerate this practice. In trying to research whether or not the reporter in question had plagiarized any other work, I quickly found that, while it's easy to compare two paragraphs, it's not easy to compare one story to hundreds of others. Automation is the way to go.

Many of the tools I found on the CPAN seemed too low level for this type of work, so I started writing Text::Plagiarized. It's not on the CPAN, nor is it available for download. However, after a bit of research, I found it was suprisingly easy to do a basic analysis (well, the code is easy to use. I threw away three implementations before I stumbled on the "easy" one.)

my $text = Text::Plagiarized->new;
foreach my $comparison (@comparison_texts) {
  print $text->percent, $/; # percent of matching sentences
  if ($text->percent > $some_threshold) {
    # arrayref of array refs with [$sentence, $possible_match]
    print Dumper($text->matches);

You can tweak how "sensitive" you want the matching to be, but so far, it handles fuzzy matching like the following two texts:

my ($text1, $text2) = (<<"END_FIRST", <<"END_SECOND");
This is some text that might be plagiarized.  Whether or not it has
been can be difficult for a simple program to detect.  The writer
may simply change a few words here and there.  He or she might add
some extra punctuation or just throw in an extra sentence or two.
However they do it, there is usually some subtle difference between
the original and the copy.
This text might be plagiarized.  Whether or not it has been can be
difficult for a simple program to detect.  The writer can simply
change a few words here and there or they might add some extra
punctuation.  However they do it, there are usually subtle
differences between the original and the copy.

At the default threshold (80% match), only the first sentence in those paragraphs fail to match. Merely setting the threshold to 74% will pick up that first sentence.

For some reason I feel a bit uncomfortable about releasing this. I'm not sure why. In any event, it's not done, so I have time to think about this. I don't account for mispellings or stemming, the interface might change, and it seems fairly fragile in odd corner cases.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Try turning that loose on SEC filings. They tend to be heavy on formalized boilerplate, giving you great fodder.
  • Perhaps you're looking at only one aspect of how a module like this may be used. Yes, it can be used for detecting plaigarism, should the user choose to do so. But it can also be used as a similarity detection metric; which has uses far beyond seeing if journalists borrowed copy or if students cribbed essays.

    Related articles ? contextual matching ? I can think of a few more uses for this type of module. I'd actually like to see how you do it, out of academic interest.

    • Good idea. Text::Related would be one possibility. This would be perfect for an open-source google news. I'd love to use the code, if you ever decide to release it.

      -DA []

    • Because of the way the code is designed, I seriously doubt that it could be used for related articles or contextual matching. It's slow, but that's because of the algorithm I chose (which turned out to be surprisingly faster than some of the other options I was looking at.) It does a sentence by sentence comparison to determine "how far apart" two sentences are in terms of insertions, deletions and replacement. If they're close enough (under the user defined threshold), then a match is reported. It's th

  • ...those horrible "scouring your site for term papers" people would start using it within miliseconds of hitting the CPAN :)
    • Actually, the term papers people already use a fairly algorithm. I used to work at a company which did something similar. The basic algorithm was similar to rsync, using hashes for blocks of tokens.