Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Try turning that loose on SEC filings. They tend to be heavy on formalized boilerplate, giving you great fodder.
  • Perhaps you're looking at only one aspect of how a module like this may be used. Yes, it can be used for detecting plaigarism, should the user choose to do so. But it can also be used as a similarity detection metric; which has uses far beyond seeing if journalists borrowed copy or if students cribbed essays.

    Related articles ? contextual matching ? I can think of a few more uses for this type of module. I'd actually like to see how you do it, out of academic interest.

    • Good idea. Text::Related would be one possibility. This would be perfect for an open-source google news. I'd love to use the code, if you ever decide to release it.

      -DA []

    • Because of the way the code is designed, I seriously doubt that it could be used for related articles or contextual matching. It's slow, but that's because of the algorithm I chose (which turned out to be surprisingly faster than some of the other options I was looking at.) It does a sentence by sentence comparison to determine "how far apart" two sentences are in terms of insertions, deletions and replacement. If they're close enough (under the user defined threshold), then a match is reported. It's th

  • ...those horrible "scouring your site for term papers" people would start using it within miliseconds of hitting the CPAN :)
    • Actually, the term papers people already use a fairly algorithm. I used to work at a company which did something similar. The basic algorithm was similar to rsync, using hashes for blocks of tokens.