Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

darobin (1316)

  (email not shown publicly)

Journal of darobin (1316)

Monday January 28, 2002
05:33 PM

Benchmarking SAX

[ #2479 ]

I've been benchmarking three PerlSAX2 parsers against one another, and the results may be useful:

Benchmark: timing 4 iterations of libxml, purepl, expat...
    libxml:   6 wallclock secs ( 5.06 usr +  0.10 sys =  5.16 CPU) @  0.78/s (n=4)
    purepl: 321 wallclock secs (289.29 usr +  6.87 sys = 296.16 CPU) @  0.01/s (n=4)
     expat:  10 wallclock secs ( 9.50 usr +  0.09 sys =  9.59 CPU) @  0.42/s (n=4)

       s/iter purepl  expat libxml
purepl   74.0     --   -97%   -98%
 expat   2.40  2988%    --    -46%
libxml   1.29  5640%    86%     --

So, XML::LibXML::SAX::Parser is really the fastest, the XML::SAX::Expat, and finally XML::SAX::PurePerl. However, this does not mandate jumping straight to XML::LibXML::SAX::Parser and forgetting about the rest.

XML::LibXML::SAX::Parser needs to load the entire document into memory before it starts firing off SAX events, which can be extremely expensive memory-wise. The reason for this is that libxml2 does not have a real SAX interface (at the C level) and that building one on its API is a true pain. Matt mentionned trying to work on that, though it's an insanely hard task and not necessarily a good one: XML::LibXML::SAX::Parser can very well remain the fastest SAX parser for small documents.

Neither of the two latter have ever been optimized yet. Some profiling could certainly help (if you're interested in profiling and have some spare time, I'm sure Matt would appreciate the help for XML::SAX::PurePerl). Also, the current plan is to convert XML::SAX::Expat to XS. Presently what it does is use XML::Parser, but that was only meant to be while it is being developed. Once it is considered stable, further development will focus on interfacing it directly to the expat library, which will make it a lot faster. XML::SAX::PurePerl will probably never be ported to XS for the very good reason that its goal is not speed, but rather portability. The idea is that if you need an XML parser somewhere where you can't compile XS modules or have trouble with it, you can always use XML::SAX::PurePerl. It'll be slower, but not dead slow unless your task is very XML intensive (the above benchmark is rather intensive, on purpose).

All of this has led me to wonder if we shouldn't have qualified module selection in XML::SAX::ParserFactory (a module that returns a SAX2 parser based on which are available and which features you request from it). By that I mean having a way to tell it "I want the fastest, no matter what", or "I need to save memory", or again "I need portability". Right now the situation is clear, but things may change between versions, and new parsers may appear. Obviously, this wouldn't affect non-XML SAX parsers. The problem I see with this is that we need the parsers to be categorized somehow, which is probably a pain to figure out. Ideas welcome ;-)

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Err, we do have qualified module selection. It's probably not quite fine grained enough yet (i.e. you can ask for a Namespace compatible parser, or a Validating parser, but you can't ask for a Fast one, or a low memory one) but those are just further features that we can define.

    Thanks for doing the benchmark though.
    • Yes I know we have qualified module selection, but not in the way that I think is compatible with what I was describing. Right now you can require a Feature, but it'll blow up if no parser has it.

      Besides, this relies on the fact that module authors declare their parsers to be fast, etc... I don't know if everyone will want to say "My parser is a slow memory hog" ;-)


      -- Robin Berjon []

      • But we can use the benchmark to convince the authors... which in any case is going to be Matt most of the time.
        Seriously I think that the memory/speed features of each module is quite well-known.