This appeared in the discussion board attached to this journal:
Looks like your work on xml parser is involving you in some of the details of LL parsing. You might want to check out www.antlr.org Despite it being a Java tool, it is a nice tool for LL(k) [IMHO, Way better than yacc and the associated lalr grammars]
Actually I find this kind of parsing really easy. The good thing about XML is its 99% suitable for "character at a time" parsing. So the tokeniser is just "give me a character". There aren't many multi-character tokens in XML, the few exceptions being the bits in the DTD (like DOCTYPE, ELEMENT, ENTITY etc). So all I do for that is to read in a character at a time and if it's not what I expect, I buffer what I've read into the Reader class and backup. Actually the reader class does all that for me - I have match() and match_str() methods which do all the work.
So, take a section of the grammar:
document
In perl code this becomes:
sub document {
my ($self, $reader) = @_;
$self->prolog($reader);
$self->element($reader);
while(!$reader->eof) {
$self->Misc($reader);
}
}
Misc is defined as:
Misc
So in perl code this becomes:
sub Misc {
my ($self, $reader) = @_;
if ($self->Comment($reader)) {
return 1;
}
elsif ($self->PI($reader)) {
return 1;
}
elsif ($self->skip_whitespace($reader)) {
return 1;
}
return 0;
}
Now this is actually slightly different from the grammar, in that S is one whitespace character, whereas skip_whitespace consumes all whitespace. But analysis of the grammar shows that Misc never comes in isolation (i.e. it's always Misc*, or (something | else | Misc)*, so consuming all WS is safe.
Finally, the actual parsing (all you really see above is expansion of the grammar into perl code - it's missing the actual parsing of tokens bit, which happens at the leaves of the code).
sub Comment {
my ($self, $reader) = @_;
if ($reader->match_string('match('-')) {
$comment_str
$reader->consume($CharMinusDash) ||
$self->parser_error(...);
$comment_str
}
elsif ($reader->consume($CharMinusDash)) {
$comment_str
}
else {
last;
}
}
$reader->match_str('-->') ||
$self->parser_error(...);
$self->lexhandler_method('comment',
{ Data => $comment_str });
return 1;
}
return 0;
}
I hope that makes sense to people
Anyway, the upshot is that I find this a lot easier to work with than grammar "toolkits" like yacc (or antlr), simply because I'm intimately familiar with Perl. I know how to debug it. Perl is my friend
Unfortunately the XML grammar is a bitch. It's not complete. The XML spec authors decided to punt on a lot of grammar issues and put those bits of the spec in the verbiage, rather than making the grammar complete (because this would confuse people reading the spec, apparently - I fail to see how that's really true!). So now I'm left trying to figure out how you might go about parsing PEReferences (those %foo; things that allow you to modularise DTDs) in the internal and external subset, since the grammar doesn't specify that properly. Bah! Maybe I'll skip that part for now