Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I've been told these are not just examples of the phenomenon known as "reinventing the wheel": the authors allegedly knew of Regex::PreSuf, and made improved versions. Hopefully. So, it might be worthwhile to actually compare these modules...
    • As the author of Regexp::Assemble, let me weigh in:

      Yes, I knew about Regex::PreSuf (and it is referenced in the SEE ALSO section of the documenation). R::PS doesn't deal with meta characters, so something like a\d+b and a\s+d is going to produce a\[ds]+b, which won't even compile.

      Regexp::List, I knew about, but you'll forgive me if I can't quite recall why I discarded it when I evaluated it. I think it gets exponentially slower as the input list grows.

      Regexp::Assemble comes with a number of scripts in the eg/ directory. One of which, assemble, allows you to create the regexp from the command line.

      Given a file containing the text:

      Perl is a language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. It's also a good language for many system management tasks. The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal).

      Then you can assemble a regular expression from the words without writing a scrap of code (apart, perhaps, a one-liner to break the strings up into words)...:

      perl -nle 'print lc $1 while /([a-z'"'"']+)/gi' perl.txt | assemble

      Which produces:

      (?:t(?:h(?:(?:os)?e|a[nt])|asks|ext|iny|o)|e(?:(?:fficie|lega)nt|xtracting|asy)| i(?:n(?:formation|tended)|(?:t')?s)|p(?:r(?:actical|inting)|erl)|m(?:an(?:agemen t|y)|inimal)|(?:complet|languag|us)e|b(?:e(?:autiful)?|ased)|a(?:rbitrary|lso|nd )?|s(?:canning|ystem)|r(?:eports|ather)|f(?:iles|rom|or)|o(?:ptimized|n)|good)

      You can also tell it to put in zero-width lookahead assertions if you think it would make the pattern match (or fail) faster. Of course, if you know your input text contains no metacharacters, Regex::PreSuf is fine.

      • Thanks for the weigh in! This looks like the industrial strength solution I will put into production. I need to dive into tries also and get a good understanding of those. I like the as_string method for readability here.

        Another fun morning with Perl and Coffee!