Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Matts (1087)

Matts
  (email not shown publicly)

I work for MessageLabs [messagelabs.com] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.

Journal of Matts (1087)

Friday February 01, 2002
02:26 AM

PurePerl speed

[ #2574 ]

I managed to get the parse of large.xml (a 70K file) down from 9 seconds to about 7 or 8 seconds. Not a huge improvement - and I didn't feel my time was terribly well spent, at least until I tried bleadperl (5.7.2 current), where previously it had been significantly slower (about 17 seconds) and was now down to about 8 or 9 seconds (it will always be slower because it does many more unicode checks when running under a unicode capable perl). So that's good.

Well good is perhaps an overstatement, since libxml2's "xmllint" program takes 29ms to parse the same file. Ah well, I think it's time to stop worrying about parsing performance, and start thinking about full compliance instead.

Why does perl make for such a crappy parser?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Because Perl s***s at XML ;--) (I will hunt you down anywhere you hide muaaaahahahaha!)
  • Matt, it may be obvious, but have you compared the algorithms used at xmllint and your Pure Perl parser?

    C allows some optimizations where Perl allow them to occur in other places.
    --
    -- Godoy.
    • The problem is that Perl is just slow. Not really much I can do about that. When you compare it to C, where it can do really nice things like char = ++*p to get the current character and move to the next byte in a string . With perl a similar idiom is: $char = substr($str, 0, 1, ''), which has a lot more overhead (same for a regexp to do the same). Character-wise coding in perl has always been a bit of a pain.
      • Err, that should have been char = *p++.

        My C sucks ;-)
      • Maybe someone needs to write a character-array manipulation class, a la PDL for huge matrix crunching. The class would gain a lot in efficiency for trading away the many capabilities Perl ordinarily gives. This would be something gross in XS, I'm sure.

        Or maybe, if I'm thinking of writing a custom text-manipulation class for Perl, something's dreadfully wrong with the world. In much the same way that we always took XML::Parser's dependence on a C parser as an indication that something was wrong (and we

        --
        J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
  • I'm presuming the answer is "Yes," but did you profile the code?

    matts: "Yes, jdavidb, I profiled the code and discovered 80% of the processing occurs in statements like $c = substr($buf, 0, 1); Get off my case! :)"

    --
    J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
    • Hehe, yeah I did profile, lots. (out of interest, anyone know why "use File::Temp" causes DProf to segfault?)

      I'm going to post something to perlmonks including the profiling output and the heavy subs in question. Maybe someone there can help out.