Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

cwest (1514)

cwest
  (email not shown publicly)
http://caseywest.com/
AOL IM: caseyrwest (Add Buddy, Send Message)

Perl geek.

Journal of cwest (1514)

Thursday May 27, 2004
10:12 PM

Parsing Fast

[ #18977 ]

While writing Email::Address I learned something. Originally I implemented it using Parse::RecDescent. I built up a complete grammar with all sorts of actions tree and eventually produced a wonderful parse tree. It was correct, down to infinitely nested comments. It was also very slow. I couldn't get it on par with Mail::Address so I had to ditch the grammar.

My next though was a home grown tokenizer, like Mail::Internet has, but that sounded dirty. I decided to try regexes, and that worked out. There are some limitations, such as a lack of recursion (without doing some very ugly black magic). I was worried about having to resort to some lame arbitrary level of nesting support, with no way to make it configurable. I knew I had to find a limit, I just didn't want it to be a hard limit. Compiled regular expressions to the rescue. Because comments are nested, the comment content is dependent on the comment expression, and vice versa. This allowed me to get my nested structure regular expression cheaply.

my ($ccontent, $comment) = ('')x2;
for (1 .. $COMMENT_NEST_LEVEL) {
   $ccontent       = qr/$ctext|$quoted_pair|$comment/;
   $comment        = qr/\s*\((?:\s*$ccontent+)\s\)\s*/;
}

So, rock on. Set the loop limit variable as you like for further
nesting. Now, an excerpt from the docs about the speed I was able to
discover.

On my 877Mhz 12" Apple Powerbook I can run the distributed benchmarks
and get results like this.

$ perl -Ilib bench/ea-vs-ma.pl bench/corpus.txt 5 
               s/iter  Mail::Address Email::Address
Mail::Address    2.44             --           -64%
Email::Address  0.884           176%             --
$ perl -Ilib bench/ea-vs-ma.pl bench/corpus.txt 25
               s/iter  Mail::Address Email::Address
Mail::Address    2.45             --           -73%
Email::Address  0.652           276%             --
$ perl -Ilib bench/ea-vs-ma.pl bench/corpus.txt 50
               s/iter  Mail::Address Email::Address
Mail::Address    2.43             --           -76%
Email::Address  0.585           316%             --

Posted from caseywest.com, comment here.

New comment creation has been disabled on this discussion.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login
Loading... please wait.