Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

jarich (4909)

jarich
  (email not shown publicly)
http://www.perltraining.com.au/
AOL IM: ManningBear (Add Buddy, Send Message)

I run Perl Training Australia [perltraining.com.au] with pjf [perl.org] and do a lot of the course writing and maintenance. I also organise the courses we run, so if you want one, just ask. I hang around a bit on Perlmonks [perlmonks.org] and also help run Melbourne Perl Mongers [pm.org].

Journal of jarich (4909)

Monday August 27, 2007
06:00 AM

PHP oddity

[ #34234 ]

The PerlNet Wiki uses Mediawiki which is written in PHP so too are the plugin scripts we use to manage it. As the wiki is open to anonymous edits, we have a fun arms race with spammers. Most of the time, they get caught by our blacklist and their edits don't even get saved. However recently one spammer has done something clever...

When I first spotted this afternoon's spam I checked it for common urls and found that most links pointed to ifrance.com. I would have added it to our blacklist but it was already there! So I checked that the anti-spam bot was still active and it was. Why wasn't this page being picked up? I took a copy of the cleanup script, printed out the regex and ran that with some of the page text: yup it matched!

I printed out what the script was seeing as the text and it matched the page content. I printed that to a file and ran a simple regex over it, yup it matched. I ran the bigger regex over it, no match.

I looked at the data again. Couldn't see anything special about it except that all of it was on one line. Well... surely it couldn't be a memory issue. I was reading the whole text into memory before performing the regex, how would line boundaries make a difference? I wasted time looking into other possibilities.

I eventually came back to the fact that it was a _very_ long single line... that a simple regex could match. Could that be it anyway? I removed a few thousand characters and wow! It started matching again.

I eventually found that (for the size of regular expression we're using) strings of 13808 characters or less would match, but any more and the match would fail.... silently. I did this with the following code:

<?php
# defines $test
include( '/tmp/spam.txt' );

# Shorten the process a little....
$test = substr($test, 0, $length-1160);

# $re removed for brevity
# while I can't match, shorten the string
while(! preg_match($re, $test, $match)) {

        $length = strlen($test);

        print "$length\n";

        $test = substr($test, 0, $length-1);
}
# Yay I matched!
print "match\n";

My string started with 14979 characters!

I wondered how much of this was because it was a very long _line_ as opposed to a very long string. So I edited the data file to add newlines after each url. It matched immediately!

I thought about the length of the regular expression (it's 2584 characters). The simple regular expression ifrance\.com had worked, so I wondered if the failure was due to alternation or capturing. I added in a small hunk of the real regex for about 30 characters (4 alternations) and it still matched. Removing a third of the real regex length (string length, not necessarily alternation opportunities) resulted in matching the string one character earlier but that was it.

Odd.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • This seems like a bug, and needs to be reported. Can you supply the offending string and regex? It may be a bug in PCRE [wikipedia.org], which is the regex engine that PHP uses for doing that. Trying the same with an equivalent ANSI C program using PCRE, may be instructive in pin-pointing the problem.

    Regards, Shlomi Fish.

    • That's what I think too. Mind you, I'm using PHP 4.3.10-22 so maybe it's already been fixed. I mentioned it only because it seemed so unlikely.

      If you have a more up to date version of PHP or just want to have a play, then you can download my test script and two test files (both which fail at first and then match as they lose length) from http://perltraining.com.au/~jarich/php-pcre.tgz [perltraining.com.au].

      The two files are clean.txt and dirty.txt. dirty.txt is one page of the spam that was successfully tricking the re

      • Running a slightly modified test script against php-cli-5.2.3-10mdv2008.0 with libpcre0-7.2-1mdv2008.0, I'm getting:

        746
        745
        744
        743
        742
        741
        740
        739
        738
        737
        Matched spam.

        So it got worse. I can later try it with grep -P or with pcregrep.

  • I brought this up with Ben Balbo (an excellent PHP programmer) and he mentioned that there are two similar bugs which have been submitted in the past.

    As he says:

    The first suggests it's a limitation of PCRE, and the second simply dismisses it as not implying a bug in PHP itself.

    As the PCRE website [pcre.org] appears to be having problems, I'm at a loss how to get this issue fixed. I've worked around it, but really I'd just rather