Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Thanks Matt -- I'm pretty excited about this code ;)
  • I stared at this piece of selfdocumenting code for a while:

            my ($regexp, $reason) = /^(.*):(.*)$/;
            $regexp =~ s/^\.\+//;
            print $re "\t", fixup_re($regexp), "               {return \"$reason\";}\n";

    but I still can't figure out

    1. what the "reason" is for
    2. how to use the compiled resulting XS module from a Perl script

    Could you please provide a 3 (or so) line data file that we could just c

    • Yeah sorry - it's a hack that I didn't have any time to document.

      Input is a file that looks like this:

      (duc|lgh|sw)[0-9]+[ab]*\.old\.fagotten\.ac:generic
      [0-9a-f]+\.myntet\.ac:generi c
      [0-9]+[a-z]\.old\.myntet\.ac:generic
      lgh[0-9]+\-p[0-9]+\.nejlikan\.ac:edu

      i.e. "regexp:reason"

      Then to run, it's just:

      use MyModule;
       
      if (my $reason = MyModule::scan($string)) {
          print "Matched: $reason\n";
      }

      • OK, I tried it... and it works. Kind of I had to delete the "time" in the system() call as Windows doesn't support it. Anyway, the sample compiled fast. Very fast. You scared me for no reason. :)

        I'm somewhat disappointed with what the module can do. I was hoping to have a basis to reimplement URI::Find [cpan.org], thus: something that can find matches anywhere in a random text. There's two major reasons why it can't do that. First: it really is a lexer: it can only match prefixes in a string. To use your examp
        • you might want to keep an eye on http://svn.apache.org/viewvc/spamassassin/branches/jm_re2c_hacks/rule2xs/ [apache.org] -- I'm hacking away on it for SpamAssassin, and I think with work you could probably find those features working there.

          Matt, does it really support [classes], (alt|er|nations), and {quantifiers}? wow, I wasn't even expecting that!! holy crap.
        • To match anywhere in the string prefix with ".*". I haven't quite figured out how to tie to the end of the string yet, but it should be doable with what the code generates.

          The limitation on just returning the reason is arbitrary - that's all I needed for the given problem domain, but you can definitely return "what matched" and "where in the string?". That should be a simple matter of programming.

          (the long compile times are for when you have LOTS of regexps - I compile over 15k into one module).
  • Matt, you've seen this already -- but for random googlers who come across this page, http://taint.org/2006/08/17/125452a.html [taint.org] may be worth reading.

    Basically, I investigated using re2c/re2xs as a speed-up engine for SpamAssassin, without much luck. The problem is that re2c can only track one regexp's state at a time, so overlapping regexps are handled inconsistently; users calling the code need to know in advance if one regexp is subsumed by another, otherwise the subsumed regexp will never match when the s
  • well, I spoke too soon -- it looks like it's pretty fast nowadays ;) consistently provides about 20% speedup for me, which is nice.

    btw, I forgot to ask -- are you OK with licensing that script's code under the ASL for inclusion in SpamAssassin?