Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

  (email not shown publicly)
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Thursday December 19, 2002
07:24 PM

HTML regex engine

[ #9561 ]

Well, no, I don't have an HTML regex engine, but I needed to do this:


Any programmer who's worked with HTML for more than about three seconds has quickly discovered that this is not a viable option. Is the HTML well-formed? Does it have extra whitespace? Did they quote their attributes properly? Is the attribute case consistent? It's a frustration.

Today, after a fair amount of searching and asking questions in the Perlmonks chatterbox, I discovered that I simply couldn't find an adequate tool to do this. Some of the treebuilder tools looked interesting, but the fine-grained control that I needed wasn't there, so I built such a tool. It's not done, but so far I can tell it to match a given document structure to another document structure and, if they match, replace the target HTML with the new HTML. I can ignore attributes, force them to be in the correct order or ignore their order, if I wish. I'm now going to start working on the text matching portion. It's been fun. Perhaps this is a CPAN module in the works?

Of course, if my past experiences with use.perl are any indication, someone's going to say "here's what you were looking for". I'd welcome that as I'm curious to see how my tool stacks up.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • As I stated on Perlmonks today ;--): Did you try tidy?
    • I didn't see your reply on Perlmonks! In any event, I'm using HTML::TokeParser::Simple to get around the problems with bad HTML and it's worked fine. My basic method is to parse my sample HTML, create a bunch of tokens that I store in an array. Then, with the target HTML, I do the same thing and if, at any point, I have a matching token stream, I do the replacement. So far, it's worked out much better than I thought.