Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

davorg (18)

davorg
  dave@dave.org.uk
http://dave.org.uk/
Yahoo! ID: daveorguk (Add User, Send Message)

Hacker, author, trainer

Technorati Profile [technorati.com]

Journal of davorg (18)

Friday July 12, 2002
03:25 PM

HTML::TreeBuilder

[ #6311 ]

HTML::TreeBuilder rocks.

Twice in the last week, I've wanted to create a table of contents for an HTML document. In both cases they've been documents that I've written, so the structure has been pretty well defined and I started to wonder whether I could automate the process.

I decided that HTML::TreeBuilder would be the right tool and started to take a closer look at the module. It's the first time I've every really used it for anything useful.

The result (after about an hour of hacking) is toc.pl. Currently it's pretty closely tied to the structure of my documents (it relies on the existance of <div> sections called "front" and "body") but it can very easily be used as a basis for other similar scripts.

Share and enjoy :)

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I did not know that module, but I have been really impressed as I was reading about it in Sean's (excellent) "Perl & LWP".
    So impressed in fact that I stole a couple of its methods and added them to XML::Twig ;--)

    --
    mirod
    • TreeBuilder is great, as it's close relative XML::TreeBuilder [cpan.org]. However they use both use HTML::Entities as found in HTML::Parser [cpan.org], and can't deal with some valid utf-8 HTML entites because of underlying problems with Perl (or so I'm told). When 5.8 goes final, I for one will upgrade just to get the maximum out of these great modules.

      Both make good efforts at dealing with dirty HTML and XML, where faster tools die...

      Neither tool is perfect, but both are pretty good, and Sean answers his email

      --
      -- "It's not magic, it's work..."