Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

agent (5836)

agent
  agentzh@yahoo.cn
http://agentzh.spaces.live.com/

Agent Zhang (章亦春) is a happy Yahoo! China guy who loves Perl more than anything else.

Journal of agent (5836)

Saturday November 29, 2008
08:48 AM

Q4 is crazy!

[ #37972 ]

Yeah, Q4 is really crazy! I've been hacking on several company projects in parallel over the last few weeks. Fortunately they're all very interesting stuffs.

We've just kicked OpenResty 0.5.2 out of the door and I'm preparing for the 0.5.3 release right now. My teammate xunxin++ has quickly implemented the YLogin handler for OpenResty, via which the users can use Yahoo! ID to login their own applications on OpenResty. Our Yahoo! registeration team helpfully worked out a sane design to allow us to reuse the Yahoo! Login system, which effectively turned Yahoo! ID into something like a passport, at least from the perspective of OpenResty users :) Big moment! Lots of company products using Yahoo! IDs could be rewritten in 100% JavaScript! Actually our team is already rewriting the Search DIY product using all the goodies offered by OpenResty.

Meanwhile, some guys from Sina.com are doing their personal projects in OpenResty. They said they really appreciated the great opportunities provided by the OpenResty architecture since various kinds of clients (e.g. web sites, cellphones, desktop apps, and etc.) could share the same set of API via OpenResty's web services). They also sent a handful of useful feedbacks and suggestions regarding OpenResty's design and implementation.

I've also been working on an intelligent crawler cluster based on Firefox, Apache mod_proxy/mod_cache, and OpenResty. The crawler itself is a plain Firefox extension named List Hunter:

    http://agentzh.org/misc/listhunter.xpi

It's an enhanced version of the Haiway List Recognization Engine used by my SearchAll extension and also built by my XUL::App framework. You can install it to your Firefox and play with it if you like ;) What this extension does is very simple: recognizing "list regions" and "text regions" in an arbitrary web page and further deciding automatically whether it's a "list page" or a "text page". The latter functionality may sound a bit weird: why is it useful to categorize web pages that way? Anyway, our PM (Product Manager) has crazy ideas about that categorization in our Live Search project and knows better than us ;)

Turning such a Firefox extension into tens or even hundreds of Firefox crawlers running on a bunch of production machines requires a lot of work. I devised a prefetching system which prefetches HTML pages and CSS files included in them, and caches the headers and contents for a fixed amount of time in such a way that Firefox crawlers can later load pages and CSS stuffs directly from the same cache in our local network, thus significantly reducing the page loading time in Gecko. The cache is a heavily patched version of Apache2's mod_cache with mod_disk_cache as the backend storage. The way prefetchers and crawlers interact with the Internet and the cache is via HTTP proxies based on Apache2's mod_proxy. Pipeling the prefetching and crawling processes requires OpenResty with PgQ enabled. Well, I'm still working on this cluster and my goal is 2 pages/sec for every single Firefox process. Firefox 3.1's amazing performance boost (more than 30% faster according to my own benchmark) makes me very confident in abusing Gecko to build efficient crawlers that takes advantage of the rich rendering information.

Another Firefox crawler project haunting my head is a similar one that automatically recognizes and extracts user comments from arbatrary web pages (if any comments appear, of course). Such tasks would be hard if my code has to run without the geometric informations of every DOM nodes provided by the browser rendering engine (in the form of offsetWidth, offsetHeight, offsetTop, and offsetLeft attributes of DOM elements). Some other collegues in our Alibaba's Search Tech Center are putting their head around Cobra, a pure Java HTML renderer. But I'm doubting that it would run more correctly or more efficiently than Gecko. Oh well, I'm not a Java guy anyway...

Finally, just a short note: I had a wonderful time with clkao and Jesse Vincent at Beijing Perl Workshop 2008. I learned pretty a lot about the Prophet internals during the hackation after the conference, and Jesse quickly hacked out a stub OpenResty model API for Prophet. Then we went to the Great Wall the next day. I was amazed to find Jesse hacking crazily on the Great Wall and enjoying the sunshines alone...Wow.

Enough blogging...back to hacking ;)

P.S. This journal was originally posted to my own blog site as http://blog.agentzh.org/#post-97

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.