I wrote a script that takes "regexp:string" pairs (one per line) and converts the regexps to re2c format [*] and builds an XS module out of them. I managed to match 10k regexps against 10k strings in 0.3s with it, which I think is fairly good.
Here's the code. Please feel free to play with it and let me know if it's useful - particularly if you can patch it to support more of the Perl regexp format.
You need a recent re2c to make it work. And don't expect your XS module to compile quickly if you have a lot of regexps. The 10k regexp test took over 3 hours to compile on my Core Duo 2Ghz.
[*] re2c uses an entirely different format for regexps than perl does. So the core of re2xs is a regexp parser which converts to this other format.
coooool (Score:1)
Example please? (Score:2)
but I still can't figure out
Could you please provide a 3 (or so) line data file that we could just c
Re:Example please? (Score:2)
Input is a file that looks like this:
i.e. "regexp:reason"
Then to run, it's just:
Re:Example please? (Score:2)
I'm somewhat disappointed with what the module can do. I was hoping to have a basis to reimplement URI::Find [cpan.org], thus: something that can find matches anywhere in a random text. There's two major reasons why it can't do that. First: it really is a lexer: it can only match prefixes in a string. To use your examp
Re:Example please? (Score:1)
Matt, does it really support [classes], (alt|er|nations), and {quantifiers}? wow, I wasn't even expecting that!! holy crap.
Re:Example please? (Score:2)
The limitation on just returning the reason is arbitrary - that's all I needed for the given problem domain, but you can definitely return "what matched" and "where in the string?". That should be a simple matter of programming.
(the long compile times are for when you have LOTS of regexps - I compile over 15k into one module).
update (Score:1)
Basically, I investigated using re2c/re2xs as a speed-up engine for SpamAssassin, without much luck. The problem is that re2c can only track one regexp's state at a time, so overlapping regexps are handled inconsistently; users calling the code need to know in advance if one regexp is subsumed by another, otherwise the subsumed regexp will never match when the s
SA? (Score:1)
btw, I forgot to ask -- are you OK with licensing that script's code under the ASL for inclusion in SpamAssassin?