XPath is just great at screenscraping, especially when combined with libxml2's xmllint tool for turning html into XML...
Here's the current temperature in london:
$ xmllint --html --format http://www.bbc.co.uk/weather/5day.shtml?world=0008 |
xpath 'normalize-space(string((//tr[starts-with(normalize-space(.), "Temperature")])[2]))'
(the above finds all the <tr>'s who's text content starts with "Temperature" (of which there are two on that page), then takes the second one of those (which is the current temperature), and then does a normalize-space on the string value of that (which means strip all the tags, basically))
I personally think using XPath for screen scraping is a bit easier than other methods of doing the same, and possibly safer too. Plus you can quite nicely apply this technique to all sorts of useful systems.
HTML is not an interface! (Score:2)
I know you're playing Devil's Advocate here, but when someone puts up a web page with data in it, they don't promise that the interface will never change. In fact they usually change it quite often, as you have to keep web designers busy ;--( OTOH if they setup a SOAP server you might have a better chance at the interface being more stable, or at least at your script complaining about a change, instead of breaking silently. Upgrading should be easier too.
Not that I like SOAP myself, mind you...
mirod
Re:HTML is not an interface! (Score:2)
Re:HTML is not an interface! (Score:2)
I'm not so sure Matt is playing devil's advocate. I think he's got his pragmatist hat placed squarely upon his head.
It would be nice from an ideological point of view if this information were in a constant format (XML-RPC, SOAP, plain XML or even a reasonably static XHTML layout). Realistically, that's not going to happen on a large scale any time
Re:HTML is not an interface! (Score:2)
Hmmmm, this whole thing is starting to make me wonder if it's not time that I should grab my old XML+CSS bat out of the cupboard and start practising a few swings... wouldn't it indeed be cool if REST style services were both human and computer readable?
Who knows, maybe this time around it won't be just Simon St. Laurent and/or me vs. xml-dev...
-- Robin Berjon [berjon.com]
I'm not sure what this was meant to do (Score:1)
Re:I'm not sure what this was meant to do (Score:2)
Yes.
The other stuff I'm not sure how to turn off in xmllint. Perhaps 2>/dev/null
Re:I'm not sure what this was meant to do (Score:2)
xpath... (Score:1)
XSLScript (xpath) + Chunks + SQL + perl
XPath is good for trees, but it sucks with 'flat' things (such as mixed content). It can be improved (see BiXpath it is kinda 'derived' from perl regexprs)
Overal - I agree that Xpath is the best existing thing.