Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

petdance (2468)

petdance
  andy@petdance.com
http://www.perlbuzz.com/
AOL IM: petdance (Add Buddy, Send Message)
Yahoo! ID: petdance (Add User, Send Message)
Jabber: petdance@gmail.com

I'm Andy Lester, and I like to test stuff. I also write for the Perl Journal, and do tech edits on books. Sometimes I write code, too.

Journal of petdance (2468)

Thursday June 12, 2003
10:36 AM

Automatic HTML validity checking

[ #12768 ]
I don't mean to toot my horn, but I've gotta spread the gospel that HTML::Lint,and its corresponding weblint wrapper are pretty darn useful. Every so often someone will ask me "Hey, can you look at my site, and make sure that it's OK?" The first thing I do is run weblint on it to check that the HTML is reasonably clean.

As an example, I ran it on Randal's website:

$ weblint http://www.stonehenge.com/merlyn
http://www.stonehenge.com/merlyn (213:5) <IMG> tag has no HEIGHT and WIDTH attributes.
http://www.stonehenge.com/merlyn (290:279) <IMG> tag has no HEIGHT and WIDTH attributes.
http://www.stonehenge.com/merlyn (293:1) <td> at (292:6) is never closed

Nothing very serious, since most browsers will handle the unclosed TD tag, and the IMG HEIGHT & WIDTH are just rendering helpers. Still, they're worth fixing.

Here's another example from someone still fixing up the pages for his upcoming book:

http://site/ (5:79) <link> is not a container -- </link> is not allowed
http://site/ (208:1) <form> at (198:1) is never closed
http://site/ (210:1) </form> with no opening <form>
http://site/ (225:6) <td> at (39:1) is never closed
http://site/ (225:6) <div> at (40:1) is never closed
http://site/ (225:6) <table> at (35:1) is never closed

Here, the problems could get into rendering issues. Older Netscapes would just freak out on unclosed tables and refuse to draw. The pair of FORM tag mismatches are probably a nesting issue.

Finally, here's a .t file for those of you with automated test suites to make sure that all the HTML files in a project have valid HTML. This is invaluable to me during the day job, because even the WYSIWYG tools that the guys up in Marketing use don't always turn out compliant HTML. If someone puts in a bad HTML file, the hourly smokebot will notice it and fire off an email to me.

#!/usr/bin/perl -w

use strict;
use Test::More;
use Test::HTML::Lint;
use File::Spec;
use File::Find::Rule;

my $startpath = '.';
my $rule = File::Find::Rule->new;
$rule->or( $rule->new->directory->name('CVS')->prune->discard,
           $rule->new->file->name('*.html') );
my @html = $rule->in( $startpath );

my $nfiles = scalar @html;

plan( tests => $nfiles );

for my $filename ( @html ) {
    open( my $fh, $filename ) or fail( "Couldn't open $filename" ), next;

    local $/ = undef;
    my $text = <$fh>;
    close $fh;

    html_ok( $text, $filename );
}

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • by vsergu (505) on 2003.06.12 10:43 (#21033) Journal

    ... even the WYSIWYG tools that the guys up in Marketing use don't always turn out compliant HTML.

    And even steakhouses don't always have good vegetarian food.

  • One of the really cunning ideas that somebody here came up with is automatic xhtml validation. In development mode, our top level autohandler (we use mason for our sites) has a filter section which passes the entire page through nsgmls. If there are any errors, it inserts them back into the page with a quick regex.

    This has been a real boon for developing a correctly validating site. Otherwise, we'd have to wait for our web designer to run the page through the validator later on and then bitch at us to fix our code. Instant feedback rocks.

    -Dom

    • Wow, what a great idea. I'm going to hack that into my current project right now.

      -sam

      • Actually, that's what Apache::Lint [cpan.org] is intended to do. It kinda works but I'm having problems with the Apache::Filter chains eating HTTP response codes. On very simple stuff, though, it seems to work OK.
        --

        --
        xoa

        • I worked out a pretty slick usage with CGI::Application's new cgiapp_postrun() method. If HTML::Lint detects errors then I put some Javascript into the outgoing page to pop open a small new window with the error text nicely formatted. I also tried it as a Javascript alert but for more than a few errors that gets hard to read.

          Now, let's see if the HTML dudes even want it! Even if they don't, I might keep it around to help me find their mistakes more easily.

          Thanks!
          -sam

          • Does it show context? weblint does, because it's a LOT easier to find the problems in big HTML files. Should HTML::Lint keep context as a convenience?
            --

            --
            xoa

          • Javascript! New window! Fab idea! This fixes one of the problems that I have with the present system in that the line numbers are off because we've inserted a bunch of extra lines in at the beginning. We worked around this by writing the source file out to /tmp so we can go in and look at it, but it's not ideal...

            -Dom

        • The only reason that I mentioned nsgmls rather than HTML::Lint is that our site is meant to be xhtml compliant, and we're used to using nsgmls. It looked like HTML::Lint only did HTML4. I should probably send you a patch...

          Not only that, but I've realised that XML::LibXML does DTD validation now; it should be able to do the same checking in memory rather than expensively spawning a copy of nsgmls.

          Hmmm... There is much work to do today!

          -Dom

  • I added the following to SciTE (on XP): command.name.0.$(file.patterns.html)=Weblint command.0.$(file.patterns.html)=weblint.bat \ $(FileNameExt) Now I can validate my HTML on the fly!!!