Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Personally, I think the images are a waste of time. merlyn (Randal Schwartz) did a column in Web Techniques for the same basic thing.

    Never one to resist a pointless challenge, before the article hit print, I wrote a "cracker" for it. The write-up is here [perlmonks.org], for those that may be interested.

    You're going to have to get a lot more tricky than 3 letters with a consistent font to stop a 'bot. Most of the time is invested in creating the font table, but once you've got that, the pattern matching is trival.

    • You've probably noticed there is noise in the background of the Slash images. Any thoughts on how difficult that makes the problem?

      If misregistration, dithering, etc. would make things harder to crack, the Slash team can do those things too. In this case, the arms race advantage goes to the server side. Tweaking text to make it less computer-readable is easy; recoding OCR algorithms is comparatively extremely difficult. The Slash code doesn't have such things yet but it would be a matter of minutes to add

      • Perhaps it's time, then. I wrote a small utility to take the images and extract the characters. Out of the 24 images or so I pulled, I was able to decode them 100%

        Mind you, all this program does is take the image, convert it to a bitmap, run a simple threshold comparison, and if the RGB value is less than a certain value, it's black, otherwise it's white. I output this as an ASCII image comprised of '#' and '.' in the 24 x 19 array.

        All the images I tested were perfectly legible, which means they can be OCR'ed at this point. I tried to add a sample of the output, but the SlashCode decided it was smarter than me, and I didn't really didn't need that extra white space. In spite of enclosing it in <code> tags, no less.

        If you're really going to pursue this idea, I would recommend looking at AltaVistas method. They use different fonts, rotate the images, etc. Harder to match, but I'm thinking some neural network software might be smart enough to extract the images.

        Oh, and addition to precluding the lynx/links users, you're also going to be cutting out the people who use PDAs with less than VGA resolution. Maybe not many now, but it's a growing userbase.

        --jcwren

        • "All the images I tested were perfectly legible, which means they can be OCR'ed at this point."

          I agree with part 1 and part 2 of the above statement, but not the "which means" part that bridges them -- there exist some legible images which can't be OCR'd.

          Example: http://www.captcha.net/cgi-bin/ez-gimpy [captcha.net]

          The Slash plugin could move to using something like this, if there's a need for it, without too much trouble. The current model is really just infrastructure to allow things like that to happen, plus o