Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

rats (5689)

rats
  (email not shown publicly)

Journal of rats (5689)

Monday February 21, 2005
09:31 PM

Scraping HTML with XML::LibXML

[ #23303 ]
Writing a test script to hit a webpage and scrape out enough from the HTML response to verify it is correct...

First test is to (stop and) start my fake xmlrpc server with the response file I want and confirm it's alive. Hmmm. RPC::XML t/* tests do lots of that so let's steal/borrow some code. Hmmm. Net is down (firewall machine again probably). minicpan to the rescue. Minicpan has saved my bacon so many times I've lost count...

Well that was relatively painless. Randy J Ray writes nice clean intelligible Perl code. My script loads an RPC::XML::Server with the canned methods and forks it to a background process then gets a page from my web app to confirm the xmlrpc server is running correctly.

Now comes the fun. I hate HTML scraping but if I have to do it, I really like to use XML::LibXML. Aside from being very fast at parsing (which isn't important for this app), I can use XPath notation to navigate the DOM tree and, even better, there's xsh to let me try out my XPaths interactively. Yes it's possible to read the HTML code and keep track of how many levels of table/tr/td you are down by hand but why waste hours when with xsh you can do this in minutes.

Ouch! A small problem. LibXML expects xhtml and crashes all over the place when I ask xsh to parse the HTML output of my webapp. Lucky(!) for me (another reason for choosing CGI::Application) I have moved all the HTML from the old webapp into HTML::Template templates. So it's really easy to rewrite it as xhtml using Vim. (I discovered after rewriting by hand that one of the options for tidy is --asxhtml. It outputs HTML as xhtml. Double d'oh!)

So now I've got clean xhtml output I can use xsh to navigate through the parsed tree and find the fields I expect to see in the page if the webapp is working correctly. The first one I want has an XPath of

/html/body/table/tr[2]/td[2]/form/a[6]

thanks to xsh. Glad I didn't have to work that one out by hand. It's a link to expand the tree. So I'll use WWW::Mechanize to click the link and grab the response and verify it returned the required number of tables in the correct order with the correct contents. And then on to the next test...

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • LibXML expects xhtml and crashes all over the place when I ask xsh to parse the HTML output of my webapp

    You know you can use parse_html* methods and set ->recover(1) to parse poorly formed HTML, right? I don't know if xsh supports this but if not, it should be easy to hack in.

    • xsh does indeed work fine with LibXML's recover mode. Type help recovering and help open in the shell for details.
      • 'Crash' was being too harsh. xsh (i.e. LibXML) actually spits out a warning for each error in the HTML/XML with recover on. I *want* those errors to display because I am able to fix them in the production templates.

        But thank you (and grantm) for the heads up.