Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Reading your post, I had an "I wonder if..." thought and checked the bug report. Yup. Eleven months agoI found and reported the same bug, noting that bug 27 had the fix, and adding a pointer to the W3C recommendation. Sad that the (one line) fix hasn't been applied, since URI is part of the core distribution.
    • URI is part of the core distribution

      It's not part of the core perl distribution. It may be part of Activestate's distribution, but that's not the same thing.

      • What comes with the ActiveState distribution is "what comes with Perl" for a very large swatch of the user community. But you're right, it isn't core. I stand corrected.
    • I looked at the fix to the URI module, and after about an hour stop working on it. There are several problems with the one-character patch:

      * It only breaks apart URIs, it doesn't put them back together

      * The parser needs to break on either a ; or a &, not both of them at the same time. Although there shouldn't be both, I'm painfully aware that "shouldn't be" means "is".

      * There is no way for the programmer to tell URI which delimiter to use. This is the rather troublesome part because it has implications
      • We had need of scanning URLs, not generating them. So, I'm embarrased to admin, I completely ignored the generation issue when figuring out a one-line patch and generating the bug report. Generating URLs is more complicated, because you'd need a way to specify whether you're going to emit them into HTML or XHTML. And the W3C recommendation isn't crystal clear on what the rules are. Oh yeah, and tests.


        • Generating URLs is more complicated, because you'd need a way to specify whether you're going to emit them into HTML or XHTML.

          That appears to make no sense whatsoever.

          • More context. If you're generating a URL to go into HTML, you typically use & to separate paramaters. For XHTML, if you're playing by the rules, you have the option of using & or ;

            Surprised me, too, but it's in the W3C recommendation.

            • But that rule applies to any content you put in XHTML or HTML documents. The fact that it’s a URI is a red herring.

              Putting entity escaping into the URI processing code is bad distribution of responsibilities. It is the caller’s job to put the URI through entity escaping when the output necessitates it.

              • HTML doesn't require that the amperand (when used to separate key value pairs) be escaped in hrefs; XHTML does.

                See []

                • Arg. That was supposed to be a preview, and not a post. I'd intended to add that while ensuring correct escape for XHTML is the programmers responsiblity, adding another layer of escaping into the pipeline after URI, rather than having a flavor of URI that knows how to escape for XHMTL, seems to me as thought it's putting the burden into the wrong place.

                  • That makes the least sense yet. If the program is outputting URIs in HTML, it is outputting HTML, and so it has to deal with properly escaping content in other contexts anyway. What differentiates URIs from other content such that piercing the separation of concerns is sensible in their case?

                  • Except that the ampersand does need to be escaped in some cases. "&foo=bar" is interpreted correctly only when the "foo" entity does not exist. This is a fallback. But "&gt=bar" is interpreted as ">=bar". This creates subtle bugs when HTML entity names are used. XHTML is more strict and does not have the entity fallback or the minimal format.

                    The advantage of using semicolon is that a properly encoded URI is always valid HTML or XHTML and doesn't need to be escaped. The downside is that s

                • It should be escaped in HTML too. The fact that it mostly works if you don't escape them, is thanks to browsers that try to accept anything people throw at them and make sense of it.

                  But, like someone else said, the HTML escaping has nothing to do with the fact that it's an URL. Any attribute of a HTML tag ought to be escaped. It is an extra layer on top of the content, but it is not part of the content itself. For example, the content of the attribute bar in the tag <foo bar="a&amp;b"> is "a