In the WWW::Mechanize talk I gave at this year's YAPC::Europe and NPW, I describe how I have passed invalid HTML through the command line tool tidy before passing it to XML::LibXML to process.
I mention that I don't use HTML::Tidy because it doesn't actually clean the HTML, it just checks for warnings. At least, that's what I thought.
Robbie, who I work with, has just showed me some code where he calls clean to do this. In my defence, the documentation confused me by saying this method returns true, whereas it actually returns the cleaned content, which happens to evaluate to true. I should get into the habit of reading documentation on AnnoCPAN, which mentions this.
I hope I haven't encouraged too many people to use a separate process to do something a CPAN module already does. The module's name makes its purpose clear enough.