Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

brianiac (4158)

  (email not shown publicly)

Journal of brianiac (4158)

Thursday July 07, 2005
12:24 AM

Entify Your HTML!

[ #25558 ]

Updated Reposted from my other journal :

To the embarassingly uninformed third party vendors of web-based applications, I present a quick look at HTML entities. This is Chapter One stuff in even the most basic HTML book, but I still get puzzled, dismissive, and even indignant replies when I request fixes for simple HTML bugs.

Three important characters: < > &

These characters are special to HTML for processing. In the text or attribute values of a page, you must use entities that stand for them: &lt; &gt; &amp;(respectively). In attributes, " should also be replaced with &quot; (you can also use &quot; in text, but it isn't a requirement).

The Web Is A Big Place

If you forget to entify your special characters, some browsers will sometimes let you get away with it. If you intend to produce code for the widest possible audience (which is the whole point of the Internet, after all), it is best not to assume your indiscretions will always go unnoticed; better to do it right to start with, and you won't have to double check every support call ($$$) to see if unentified HTML is part of the problem.

Unentified HTML Is Insecure HTML

All Cross-Site Scripting (XSS) attacks are caused by unentified HTML, and can be prevented using entities. The liability of such an attack, though potentially considerable, is nothing compared to the loss of client trust.

It's Easy

Every web development language has a single function you can call to entify the contents of string or text variables (numeric and date/time variables do not typically require escaping), e.g. Server.HTMLEncode() in Active Server Pages or htmlentities() in PHP. In cases where the language does not provide such a function, writing one is trivial: four search-and-replace calls (do the ampersand first).

It just kills me how often I see unencoded HTML (of the severity that actually breaks things), and how defensive companies get when it's pointed out. As if it were a lengthy or difficult fix.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Are you saying ' and " must not be single octets anywhere in the text of HTML, but rather, that &apos; and &quot; must be used? What part of the HTML spec did you get that from? In attribute values, yes, but in the text, there's no such requirement.
    • These are subtleties that I see no need to try to explain to people that resist even encoding . This is not a normative reference work, but a rule of thumb intended for an audience with a poor track record of understanding and accepting standards.
      • "...even encoding < and > ."

        Apparently this is some strange new usage of "Plain Old Text" that I was not previously aware of.

        • Plain Old Text still allows HTML, but it takes care of newlines, and it's worked that way for years in Slash without too many complaints.

          We'd get a lot more if it worked as you wanted it to. You wanted Extrans. Which has a lame name. But oh well.
      • It's not a subtlety: there is simply no need to encode ' and " anywhere but in attributes.

        Also, &apos; is illegal in HTML. It's only a named entity in XHTML.
        • But not even all attributes require &quot:, only attributes that use " as a delmiter. My original intent was a single clear rule, without dithering or qualifying that could be used as an excuse to ignore this process based on complexity.

          I work with several third-party vendors in the financial sector, who have a great deal to learn about web development. This is meant to be simple enough to remember for those writing home banking, bill payment, loan application, and other web apps, but also apps fr

          • I'd say a better rule is to not delimit attributes with ', and use onle ", as God intended. :-)

            As to why &apos; is missing, no idea, but yeah, it seems lame. But as everyone seems to accept it (except for proper strict validators that know about entities), it's not something I'd personally care too much about. Just offering it as a footnote.
      • So people are resisting what? A rule of thumb?