Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

chromatic (983)

  (email not shown publicly)

Blog Information [] Profile for chr0matic []

Journal of chromatic (983)

Friday March 17, 2006
10:09 PM

Things Not To Do at 4 pm on Friday on the Internet

[ #29031 ]

I'm not a full-time developer anymore, but I still know a few things.

  • Don't ever deply new code on a production system after lunch.
  • ... and never, ever do this on a Friday afternoon.
  • Don't parse XML with regular expressions.
  • Don't treat newlines as significant in XHTML, except within attribute values, where the specification says not to use them.
  • Don't double-encode text and markup.
  • Don't add extra formatting to properly-formatted documents.
  • Don't deploy new code without testing it first on actual data.
  • Don't tell your users that they have to change the way they publish information in your system because you just made a change on 4 pm on a Friday afternoon that treats newlines as significant in XHTML though nothing else on the Internet does so, tell your users to fix all of their existing data, and then go home before they can say "Uh... that's broken. Why did you do this?"
  • Don't remove a feature that lets people who know what they're doing disable all of this "helpful" magic and turn on the magic by default if it changes their existing data.
  • If you ignore everything else, don't claim "Oh, you can still write valid XHTML. Just don't use paragraph tags and let the system put them in automatically."

I really don't have words for this except "BY DEFAULT?", "4 PM?!", and "FRIDAY AFTERNOON?!?!"

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • A very good list of truisms. However, there is the subtlest of subtle flaws in this list -- all categorical statements (including this one) are false. For example:

    Don't parse XML with regular expressions.

    Sometimes you do want to parse XML with regexes, but only in the most controlled of circumstances. Usually this involves munging huge quantities of data that are very rigidly formatted. If you can fully control the structure of XML inputs, and you tend to be reading inputs line-by-line (or bloc

    • In the case to which I allude, I assume that the code (I have not seen it) processes the XHTML line-by-line. This is a problem because the XML specification allows newline characters as valid whitespace characters within tags. This is a big problem because the input comes from arbitrary sources.

      Parsing this XHTML without a stack or state machine somewhere is problematic.

      • So the real dictum is Don't parse arbitrary XML with regular expressions.

        Yep. No wiggle room on that. That's as hard and fast a rule as don't divide an integer by zero. :-)
        • Shouldn't that be don't divide an integer by an arbitrary number?
          perl -e 'print "Just another Perl ${\(trickster and hacker)},";'
          The Sidhekin proves that Sidhe did it!
    • It is possible to parse arbitrary XML with regular expressions. However, it can't be done line-by-line because tags can contain newlines. It must be done on the whole file (or have some smart buffering).

      There is a paper, [], which develops the regex for parsing XML.

      • Note that those patterns parse simple XML, not XML with namespaces. Parsing XML with namespaces purely using pattern matching is probably possible too, but it’d be a whole hell of a lot harder, and the patterns would be nasty monstrosities far more so than the managable beasts from that paper.

        • They will parse XML with namespaces. But they only break the XML into pieces; tags, comments, text, etc. They don't handle the pieces like breaking tags into names and attribute values. They don't handle resolving namespace prefixes into canonical names.

          I suspect they aren't suitable for doing interesting operations. They could be used for stuff that works on the chunks, like removing comments.