The PerlNet Wiki uses Mediawiki which is written in PHP so too are the plugin scripts we use to manage it. As the wiki is open to anonymous edits, we have a fun arms race with spammers. Most of the time, they get caught by our blacklist and their edits don't even get saved. However recently one spammer has done something clever...
When I first spotted this afternoon's spam I checked it for common urls and found that most links pointed to ifrance.com. I would have added it to our blacklist but it was already there! So I checked that the anti-spam bot was still active and it was. Why wasn't this page being picked up? I took a copy of the cleanup script, printed out the regex and ran that with some of the page text: yup it matched!
I printed out what the script was seeing as the text and it matched the page content. I printed that to a file and ran a simple regex over it, yup it matched. I ran the bigger regex over it, no match.
I looked at the data again. Couldn't see anything special about it except that all of it was on one line. Well... surely it couldn't be a memory issue. I was reading the whole text into memory before performing the regex, how would line boundaries make a difference? I wasted time looking into other possibilities.
I eventually came back to the fact that it was a _very_ long single line... that a simple regex could match. Could that be it anyway? I removed a few thousand characters and wow! It started matching again.
I eventually found that (for the size of regular expression we're using) strings of 13808 characters or less would match, but any more and the match would fail.... silently. I did this with the following code:
<?php
# defines $test
include( '/tmp/spam.txt' );
# Shorten the process a little....
$test = substr($test, 0, $length-1160);
# $re removed for brevity
# while I can't match, shorten the string
while(! preg_match($re, $test, $match)) {
$length = strlen($test);
print "$length\n";
$test = substr($test, 0, $length-1);
}
# Yay I matched!
print "match\n";
My string started with 14979 characters!
I wondered how much of this was because it was a very long _line_ as opposed to a very long string. So I edited the data file to add newlines after each url. It matched immediately!
I thought about the length of the regular expression (it's 2584 characters). The simple regular expression ifrance\.com had worked, so I wondered if the failure was due to alternation or capturing. I added in a small hunk of the real regex for about 30 characters (4 alternations) and it still matched. Removing a third of the real regex length (string length, not necessarily alternation opportunities) resulted in matching the string one character earlier but that was it.
Odd.
Seems like a bug (Score:2)
This seems like a bug, and needs to be reported. Can you supply the offending string and regex? It may be a bug in PCRE [wikipedia.org], which is the regex engine that PHP uses for doing that. Trying the same with an equivalent ANSI C program using PCRE, may be instructive in pin-pointing the problem.
Regards, Shlomi Fish.
Re: (Score:1)
That's what I think too. Mind you, I'm using PHP 4.3.10-22 so maybe it's already been fixed. I mentioned it only because it seemed so unlikely.
If you have a more up to date version of PHP or just want to have a play, then you can download my test script and two test files (both which fail at first and then match as they lose length) from http://perltraining.com.au/~jarich/php-pcre.tgz [perltraining.com.au].
The two files are
clean.txtanddirty.txt. dirty.txt is one page of the spam that was successfully tricking the reRe: (Score:2)
Running a slightly modified test script against php-cli-5.2.3-10mdv2008.0 with libpcre0-7.2-1mdv2008.0, I'm getting:
So it got worse. I can later try it with grep -P or with pcregrep.
Kind of a known bug.... sort of (Score:1)
I brought this up with Ben Balbo (an excellent PHP programmer) and he mentioned that there are two similar bugs which have been submitted in the past.
As he says:
As the PCRE website [pcre.org] appears to be having problems, I'm at a loss how to get this issue fixed. I've worked around it, but really I'd just rather