Friday March 17, 2006
10:09 PM
Things Not To Do at 4 pm on Friday on the Internet
I'm not a full-time developer anymore, but I still know a few things.
- Don't ever deply new code on a production system after
lunch.
- ... and never, ever do this on a Friday afternoon.
- Don't parse XML with regular expressions.
- Don't treat newlines as significant in XHTML, except within attribute
values, where the specification says not to use them.
- Don't double-encode text and markup.
- Don't add extra formatting to properly-formatted documents.
- Don't deploy new code without testing it first on actual data.
- Don't tell your users that they have to change the way they publish
information in your system because you just made a change on 4 pm on a Friday
afternoon that treats newlines as significant in XHTML though nothing else on
the Internet does so, tell your users to fix all of their existing data, and
then go home before they can say "Uh... that's broken. Why did you do
this?"
- Don't remove a feature that lets people who know what they're doing
disable all of this "helpful" magic and turn on the magic by default if it
changes their existing data.
- If you ignore everything else, don't claim "Oh, you can still write valid
XHTML. Just don't use paragraph tags and let the system put them in
automatically."
I really don't have words for this except "BY DEFAULT?", "4 PM?!", and
"FRIDAY AFTERNOON?!?!"
XML Regexes (Score:2)
A very good list of truisms. However, there is the subtlest of subtle flaws in this list -- all categorical statements (including this one) are false. For example:
Sometimes you do want to parse XML with regexes, but only in the most controlled of circumstances. Usually this involves munging huge quantities of data that are very rigidly formatted. If you can fully control the structure of XML inputs, and you tend to be reading inputs line-by-line (or bloc
Re:XML Regexes (Score:1)
In the case to which I allude, I assume that the code (I have not seen it) processes the XHTML line-by-line. This is a problem because the XML specification allows newline characters as valid whitespace characters within tags. This is a big problem because the input comes from arbitrary sources.
Parsing this XHTML without a stack or state machine somewhere is problematic.
Re:XML Regexes (Score:2)
Yep. No wiggle room on that. That's as hard and fast a rule as don't divide an integer by zero.
Re:XML Regexes (Score:1)
perl -e 'print "Just another Perl ${\(trickster and hacker)},";'
The Sidhekin proves that Sidhe did it!
Re:XML Regexes (Score:2)
Re:XML Regexes (Score:1)
There is a paper, http://www.cs.sfu.ca/~cameron/REX.html [cs.sfu.ca], which develops the regex for parsing XML.
Re:XML Regexes (Score:1)
Note that those patterns parse simple XML, not XML with namespaces. Parsing XML with namespaces purely using pattern matching is probably possible too, but it’d be a whole hell of a lot harder, and the patterns would be nasty monstrosities far more so than the managable beasts from that paper.
Re:XML Regexes (Score:1)
I suspect they aren't suitable for doing interesting operations. They could be used for stuff that works on the chunks, like removing comments.
Re:XML Regexes (Score:1)
Well, or building a full-fledged parser on top. That’s not a very large step from there.