Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Matts (1087)

Matts
  (email not shown publicly)

I work for MessageLabs [messagelabs.com] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.

Journal of Matts (1087)

Wednesday April 24, 2002
08:23 AM

Who needs SOAP!

[ #4422 ]

XPath is just great at screenscraping, especially when combined with libxml2's xmllint tool for turning html into XML...

Here's the current temperature in london:

$ xmllint --html --format http://www.bbc.co.uk/weather/5day.shtml?world=0008 |
  xpath 'normalize-space(string((//tr[starts-with(normalize-space(.), "Temperature")])[2]))'

(the above finds all the <tr>'s who's text content starts with "Temperature" (of which there are two on that page), then takes the second one of those (which is the current temperature), and then does a normalize-space on the string value of that (which means strip all the tags, basically))

I personally think using XPath for screen scraping is a bit easier than other methods of doing the same, and possibly safer too. Plus you can quite nicely apply this technique to all sorts of useful systems.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I know you're playing Devil's Advocate here, but when someone puts up a web page with data in it, they don't promise that the interface will never change. In fact they usually change it quite often, as you have to keep web designers busy ;--( OTOH if they setup a SOAP server you might have a better chance at the interface being more stable, or at least at your script complaining about a change, instead of breaking silently. Upgrading should be easier too.

    Not that I like SOAP myself, mind you...

    --
    mirod
    • The problem is SOAP doesn't always exist.
    • I know you're playing Devil's Advocate here, but when someone puts up a web page with data in it, they don't promise that the interface will never change.

      I'm not so sure Matt is playing devil's advocate. I think he's got his pragmatist hat placed squarely upon his head.

      It would be nice from an ideological point of view if this information were in a constant format (XML-RPC, SOAP, plain XML or even a reasonably static XHTML layout). Realistically, that's not going to happen on a large scale any time

      • Hmmmm, this whole thing is starting to make me wonder if it's not time that I should grab my old XML+CSS bat out of the cupboard and start practising a few swings... wouldn't it indeed be cool if REST style services were both human and computer readable?

        Who knows, maybe this time around it won't be just Simon St. Laurent and/or me vs. xml-dev...

        --

        -- Robin Berjon [berjon.com]

  • but this is what I get

    http://www.bbc.co.uk/weather/5day.shtml?world=0008:115: error: htmlParseEntityRef: no name
    Helvetica" SIZE="2"><a href="/weather/sports/index.shtml" class="index">Sport
                                                                                  ^
    http://www.bbc.co.uk/weather/5day.sht

  • hrgab [pault.com] - that's the way I read the internet ;-)

    XSLScript (xpath) + Chunks + SQL + perl

    XPath is good for trees, but it sucks with 'flat' things (such as mixed content). It can be improved (see BiXpath it is kinda 'derived' from perl regexprs)

    Overal - I agree that Xpath is the best existing thing.