Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I stared at this piece of selfdocumenting code for a while:

            my ($regexp, $reason) = /^(.*):(.*)$/;
            $regexp =~ s/^\.\+//;
            print $re "\t", fixup_re($regexp), "               {return \"$reason\";}\n";

    but I still can't figure out

    1. what the "reason" is for
    2. how to use the compiled resulting XS module from a Perl script

    Could you please provide a 3 (or so) line data file that we could just c

    • Yeah sorry - it's a hack that I didn't have any time to document.

      Input is a file that looks like this:

      (duc|lgh|sw)[0-9]+[ab]*\.old\.fagotten\.ac:generic
      [0-9a-f]+\.myntet\.ac:generi c
      [0-9]+[a-z]\.old\.myntet\.ac:generic
      lgh[0-9]+\-p[0-9]+\.nejlikan\.ac:edu

      i.e. "regexp:reason"

      Then to run, it's just:

      use MyModule;
       
      if (my $reason = MyModule::scan($string)) {
          print "Matched: $reason\n";
      }

      • OK, I tried it... and it works. Kind of — I had to delete the "time" in the system() call as Windows doesn't support it. Anyway, the sample compiled fast. Very fast. You scared me for no reason. :)

        I'm somewhat disappointed with what the module can do. I was hoping to have a basis to reimplement URI::Find [cpan.org], thus: something that can find matches anywhere in a random text. There's two major reasons why it can't do that. First: it really is a lexer: it can only match prefixes in a string. To use your example:
        (duc|lgh|sw)[0-9]+[ab]*\.old\.fagotten\.ac:generic
        matches both "duc123.old.fagotten.ac" and "duc123.old.fagotten.acdc", but not "viaduc123.old.fagotten.ac". Second: it doesn't return what it matched, or even the length of the match, it just returns a "reason", in this case: the string "generic". That's not very useful, even for a lexer.

        What I would really love to see, is a search for a substring, using a scheme that may resemble Boyer-Moore [utexas.edu] for skipping over uninteresting submatches: based on what was seen earlier, you just know some prefixes just can't match, and you can just skip them. Opposed to fixed search string, I'm guessing that this will be anything but trivial — what a regex matches has neither a fixed length nor fixed characters in each position. I have doubts that it's even possible, in the generic case.

        I don't expect that re2c would support anything even remotely like it.

        In summary: I find it simply amazing work how easy you make it to generate an XS module from code generated using an external tool. As for the limitations to make it really useful, for the applications I'm thinking of, are probably limitations in what re2c can do.
        • you might want to keep an eye on http://svn.apache.org/viewvc/spamassassin/branches/jm_re2c_hacks/rule2xs/ [apache.org] -- I'm hacking away on it for SpamAssassin, and I think with work you could probably find those features working there.

          Matt, does it really support [classes], (alt|er|nations), and {quantifiers}? wow, I wasn't even expecting that!! holy crap.
        • To match anywhere in the string prefix with ".*". I haven't quite figured out how to tie to the end of the string yet, but it should be doable with what the code generates.

          The limitation on just returning the reason is arbitrary - that's all I needed for the given problem domain, but you can definitely return "what matched" and "where in the string?". That should be a simple matter of programming.

          (the long compile times are for when you have LOTS of regexps - I compile over 15k into one module).