i am just going to do some scraping work and W::S works great so far. the doc is lacking though, the examples you posted in past journal helped! have few questions though:
the example from the doc has: process "h3.ens>a",
where the ens seems to be doing wildcard matching, any class name contains ens.
html page contains utf8 characters such as è , that made HTML::Parser complain. Parsing of undecoded UTF-8 will give garbage when decoding entities
HTML::Parser mentioned encoding the data
1. If you want a wildcard matching you can change the selector expression to something like ".ens>a"
2. Web::Scraper does whatever it can do to decode utf-8 characters back to Unicode as possible, as long as you pass the URI object and the HTML page has a correct Content-Type header. Otherwise you need to fetch the page into a variable and call Encode::decode to get the Unicode character back.
3. result keyword can specify which stash variable you want to get as a result. You can omit it if you want th
No, ".ens>a" does exact match. Or in other words, exact match with class name. If you want to match partial class names, you might need to do a[@class=~"ens"] or something like that. Read CSS Selector spec [w3.org] for details.
No, actually, “.ens > a” matches an “a” element inside an element of any name with class “ens”, whereas “a[class~="ens"]” wants to see the class on the “a” element itself. The partial-match version would actually be “*[class~="ens"] > a”.
good stuff! (Score:1)
i am just going to do some scraping work and W::S works great so far. the doc is lacking though, the examples you posted in past journal helped! have few questions though:
process "h3.ens>a",where the ens seems to be doing wildcard matching, any class name contains ens.
Parsing of undecoded UTF-8 will give garbage when decoding entitiesHTML::Parser mentioned encoding the data
Re: (Score:2)
Re: (Score:1)
.ens>adoes that matching any class name contain the string 'ens'? what is the syntax for exact matching on a classname then?Re: (Score:2)
Re: (Score:2)
Re: (Score:1)
No, actually, “
.ens > a” matches an “a” element inside an element of any name with class “ens”, whereas “a[class~="ens"]” wants to see the class on the “a” element itself. The partial-match version would actually be “*[class~="ens"] > a”.Re: (Score:2)
Re: (Score:1)
er. my bad. i thought
great module, thanks!class="listing first"is one class name. it is 'listing' and 'first'.