tinman spent a few years mucking around industry before going back to school for a Masters. Currently not enjoying the weather in North England..
He wrote Perl that looked suspiciously like C code in 1998, while working as an intern, and has been trying to cure that bad habit ever since.
So I thought I knew XML.Bah.Fooling around with the TREC competition data told me otherwise.
The problem was pretty simple, or so I thought. Just parse the TREC sample data (all 3GB of it), index it, and then build ever more "intelligent" query parsing functionality on top. The first snag in that grand plan was... the TREC XML data fails to parse!
For one, there is no XML header. But more importantly, there is an external (unreferenced) DTD available which contains entities. If I just throw the document at the parser, it barfs because it cant resolve the external entities! Begging and pleading to the Xerces parser didnt help. Nor did using EntityResolver.
So, I cursed Java and its XML parsers and came back to my trusty Perl roots. The standard XML::Parser said the same thing! Despite methods in the Java parser instance assuring me that external entities CAN be safely ignored, the parser doesnt seem to want to do that. Then I looked at other Perl based parsers and found XML::LibXML. It specifically has a method that says dont resolve "external_entities" . Umm.. didnt seem to work either?
I didnt really want to use a ugly handrolled parser solution (because that is going to break at some point, sooner rather than later). So, the only remaining option seems to be to use HTML::TokeParser and find tags. *sigh* and that, sadly, is the only solution that seems to work.
Some days(weeks) it just doesnt seem to pay to get out of bed. With strong winds in York (and ultra cold too *shiver*), this seems to be one of them.