Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

agent (5836)

agent
  agentzh@yahoo.cn
http://agentzh.spaces.live.com/

Agent Zhang (章亦春) is a happy Yahoo! China guy who loves Perl more than anything else.

Journal of agent (5836)

Wednesday April 22, 2009
09:56 PM

SSH::Batch: Treating clusters as maths sets and intervals

System administration is also part of my $work. Playing with a (big) bunch of  machines without a handy tool is painful. So I refactored some of our old scripts and released SSH::Batch, a collection of useful parallel ssh scripts, to CPAN:

    http://search.cpan.org/dist/SSH-Batch/

SSH::Batch allows you to name your clusters using variables and interval/set syntax in your ~/.fornodesrc config file. For instance:

    $ cat ~/.fornodesrc
    A=foo[01-03].com bar.org
    B=bar.org baz[a-b,d,e-g].cn foo02.com
    C={A} * {B}
    D={A} - {B}

where cluster C is the intersection set of cluster A and B while D is those machines in A but not in B.

And then you can query machine host list by using SSH::Batch's fornodes script:

   $ fornodes '{C}'
   bar.org
   foo02.com

   $ fornodes '{D}'
   foo01.com
   foo03.com

Furthermore, to run a command on a cluster by the concurrency of 6:

   atnodes 'ls -lh' '{A} + {B}' my.more.com -c 6

Or upload a local file to the remote cluster:

  tonodes ~/my.tar.gz '{A} / {B}' :/tmp/

There's also a key2nodes script to push SSH public keys to remote machines ;)

A colleague in Alibaba B2B is already using it. And one of my teammates is going to use it to operate on those thousands of machines in our instance of the YST (Yahoo! Search Technology) cluster and I'm ready to receive more feedback from him ;)

Have fun :)

P.S. This entry was originally posted to my own blog site as http://blog.agentzh.org/#post-105

Thursday April 09, 2009
11:01 PM

My VDOM.pm & WebKit Cluster Talk at the April Meeting of Bei

Last night I gave a talk to our PerlChina folks at the April meeting in the Flow Bar. Here's the slides that I used:

The XUL format is the best among the three ;)

Just as the topic of the talk suggests, we're migrating from Firefox clusters to WebKit ones. I'll post more details here in the near future.

Enjoy!

P.S. This entry was originally posted to my own blog site as http://blog.agentzh.org/#post-104

Friday February 13, 2009
12:13 AM

The slides for my talk on Firefox cluster & vision-based web

I gave a talk at the Beijing Perl Mongers' Feb Meeting last night. It was about my Firefox cluster and vision-based web page extraction technology. I had not expected to see so many people there. Wow. The talk was well received and people asked lots of interesting questions :)

The slides can be freely downloaded from my site (open the ffcluster.xul file in the tarball via Firefox):

    http://agentzh.org/misc/slides/BJPW200902.tar.gz

or browse directly online by Firefox:

    http://agentzh.org/misc/slides/BJPW200902/ffcluster.xul

Because it has many big pictures in it, it's recommended to download it to your local side first and display offline :)

I'll also give this presentation again to those Ruby/Python/Java/C++ guys at Beijing OpenParty's Fox meeting:

    http://www.beijing-open-party.org/index.php/2009/02/beijing-open-party-2009-02-f ox-event-begin.html

As a side note: recently I'm intrigued by Apache C hacking. My mod_libmemcached_cache is my first Apache module. And I'd love to see more in the near future, such as mod_openresty ;)

Have fun!

P.S. This entry was originally posted to my personal blog site: http://blog.agentzh.org/#post-102

Saturday November 29, 2008
08:48 AM

Q4 is crazy!

Yeah, Q4 is really crazy! I've been hacking on several company projects in parallel over the last few weeks. Fortunately they're all very interesting stuffs.

We've just kicked OpenResty 0.5.2 out of the door and I'm preparing for the 0.5.3 release right now. My teammate xunxin++ has quickly implemented the YLogin handler for OpenResty, via which the users can use Yahoo! ID to login their own applications on OpenResty. Our Yahoo! registeration team helpfully worked out a sane design to allow us to reuse the Yahoo! Login system, which effectively turned Yahoo! ID into something like a passport, at least from the perspective of OpenResty users :) Big moment! Lots of company products using Yahoo! IDs could be rewritten in 100% JavaScript! Actually our team is already rewriting the Search DIY product using all the goodies offered by OpenResty.

Meanwhile, some guys from Sina.com are doing their personal projects in OpenResty. They said they really appreciated the great opportunities provided by the OpenResty architecture since various kinds of clients (e.g. web sites, cellphones, desktop apps, and etc.) could share the same set of API via OpenResty's web services). They also sent a handful of useful feedbacks and suggestions regarding OpenResty's design and implementation.

I've also been working on an intelligent crawler cluster based on Firefox, Apache mod_proxy/mod_cache, and OpenResty. The crawler itself is a plain Firefox extension named List Hunter:

    http://agentzh.org/misc/listhunter.xpi

It's an enhanced version of the Haiway List Recognization Engine used by my SearchAll extension and also built by my XUL::App framework. You can install it to your Firefox and play with it if you like ;) What this extension does is very simple: recognizing "list regions" and "text regions" in an arbitrary web page and further deciding automatically whether it's a "list page" or a "text page". The latter functionality may sound a bit weird: why is it useful to categorize web pages that way? Anyway, our PM (Product Manager) has crazy ideas about that categorization in our Live Search project and knows better than us ;)

Turning such a Firefox extension into tens or even hundreds of Firefox crawlers running on a bunch of production machines requires a lot of work. I devised a prefetching system which prefetches HTML pages and CSS files included in them, and caches the headers and contents for a fixed amount of time in such a way that Firefox crawlers can later load pages and CSS stuffs directly from the same cache in our local network, thus significantly reducing the page loading time in Gecko. The cache is a heavily patched version of Apache2's mod_cache with mod_disk_cache as the backend storage. The way prefetchers and crawlers interact with the Internet and the cache is via HTTP proxies based on Apache2's mod_proxy. Pipeling the prefetching and crawling processes requires OpenResty with PgQ enabled. Well, I'm still working on this cluster and my goal is 2 pages/sec for every single Firefox process. Firefox 3.1's amazing performance boost (more than 30% faster according to my own benchmark) makes me very confident in abusing Gecko to build efficient crawlers that takes advantage of the rich rendering information.

Another Firefox crawler project haunting my head is a similar one that automatically recognizes and extracts user comments from arbatrary web pages (if any comments appear, of course). Such tasks would be hard if my code has to run without the geometric informations of every DOM nodes provided by the browser rendering engine (in the form of offsetWidth, offsetHeight, offsetTop, and offsetLeft attributes of DOM elements). Some other collegues in our Alibaba's Search Tech Center are putting their head around Cobra, a pure Java HTML renderer. But I'm doubting that it would run more correctly or more efficiently than Gecko. Oh well, I'm not a Java guy anyway...

Finally, just a short note: I had a wonderful time with clkao and Jesse Vincent at Beijing Perl Workshop 2008. I learned pretty a lot about the Prophet internals during the hackation after the conference, and Jesse quickly hacked out a stub OpenResty model API for Prophet. Then we went to the Great Wall the next day. I was amazed to find Jesse hacking crazily on the Great Wall and enjoying the sunshines alone...Wow.

Enough blogging...back to hacking ;)

P.S. This journal was originally posted to my own blog site as http://blog.agentzh.org/#post-97

Sunday September 28, 2008
05:36 AM

Now we have Actions!

On behalf of the OpenResty team, I'm happy to announce that OpenResty 0.5.0 has been released to CPAN, which means OpenResty has hit its 5th milestone indicated by a working Action API.

I've found Acitons very useful in grouping together concurrent AJAX requests, which will make webpages load much faster. Our blog sites are already taking full advantage of this trick:

    http://blog.agentzh.org
    http://www.eeeeworks.org

Also, Actions ensure cascaded requests run in exactly the expected order and the REST interfaces are called (mostly) in the expected way (e.g. from the end users' web browser). There used to be a serious security hole in the above blog sites in past because I had to expose PUT /=/model/Post/~/~ to the Public role for updating the "comments" field in the Post model before we have Actions.

The main server for OpenResty, api.openresty.org, has already been upgraded to 0.5.0. If you want to play with OpenResty directly on our servers, feel free to write to me (agentzh at yahoo dot cn) and get an account for free!

Enjoy!
Saturday September 27, 2008
10:23 PM

pod2html.js: Some JavaScript love for POD in a browser

It's fun to do POD (Plain Old Documentation) in a web browser and I've hacked up a JavaScript implementation for the pod2html utility (actually the output is more like Pod::Simple::HTML).

The pod2html.js script is in OpenResty's SVN repository:

   http://svn.openfoundry.org/openapi/trunk/demo/Onccf/js/pod2html.js

The API is straightforward, for instance,

   var pod = "=head1 Blah\n\nI<Hello>, C<world>!\n";
   var html = pod2html(pod);

The following web site is already making use of it:

   http://agentzh.org/misc/onccf/out/

By sniffing the background AJAX requests (e.g. using Firebug), you can see raw POD is retrieved from the OpenResty server and converted to HTML on-the-fly in your browser.

It's worth mentioning that I had a lot of fun combining Test::Base and JavaScript::SpiderMonkey to test this piece of JavaScript code in pure Perl. You can checkout the test script here:

   http://svn.openfoundry.org/openapi/trunk/demo/Onccf/t/01-pod2html.t

By looking at the (declarative) test cases, it's trivial to see what it can do (and hopefully what it can't) :)

For the record, as of this writing, the following POD directives are supported:

  =headN, =over, =item *, =item NUM., =item TEXT, =back, =begin html, =end html, =begin ANY, =end ANY, =cut (it's a no-op), =encoding ANY (it's a no-op)

and the following POD markups are implemented:

   C<...>, I<...>, B<...>, L<...>, F<...>

I've also implemented the (non-standard) =image directive for convenience. For example,

   =image gate.jpg

will be converted to

   <p><img src="gate.jpg"/></p>

Have fun!

P.S. This journal was originally posted to my personal blog site: http://blog.agentzh.org/#post-93

Tuesday August 05, 2008
05:27 AM

Filter::QuasiQuote 0.01 is now on CPAN

After reading Audrey's blog post mentioning GHC's upcoming quasiquoting feature (as well as that quasiquoting paper), I quickly hacked up a (simple) quasiquoting mechanism for Perl, hence the Filter::QuasiQuote module already on CPAN:

http://search.cpan.org/perldoc?Filter::QuasiQuote

I'm looking forward to using sensible filters in my production code (e.g. OpenResty) and eliminating ugly Perl code for with embedded DSL. For example, instead of writing

    my $sql = "alter table " . quote_identifer($table) . " drop column " . quote($column) . ";";

I can simply write

    use OpenResty::QuasiQuote::SQL;
    my $sql = [:sql| alter table $table drop column $column; |];

Also, a JSON-like DSL can be used to describe valid Perl data structures and to generate the Perl code doing validation.

Filter::QuasiQuote supports subclassing, so the OpenResty::QuasiQuote::SQL module mentioned above could be derived from it. Also, multiple concrete filter classes could be composed in a single Perl source file. Just put a series of use statements together:

    use MyQuote1;
    use MyQuote2;

and it should work. Because it's required that filters always return Perl source aligned in a single line, line numbers won't get corrupted.

Of course, lots of nice consequences of the Haskell quasiquotations will be lost in my implementation, such as type safety. But the Perl version is much more flexible and powerful (by some definition) ;)

It's still in alpha and could be buggy. Feel free to report bugs or send wishlist to the CPAN RT site or directly to me ;)

Enjoy!
Thursday June 19, 2008
09:26 PM

UML::Class::Simple 0.10 released

I've just uploaded UML::Class::Simple 0.10 to CPAN with the highlight of the XMI format support. It will appear on the CPAN mirror near you in the next few hours.

Thanks Maxim Zenin for contributing this feature :) A Japanese user was requesting this in his blog as well. If you're a XMI fanboy, feel free to try it out.

Thursday September 20, 2007
04:47 AM

The SearchAll Firefox Plugin and XUL::App framework

My first $job project is now opensourced. It's a Firefox extension named SearchAll.

SearchAll is a simple side-by-side search engine comparing tool which allows you to search at most 3 different search engines simultaneously and benchmark their performance in the status bar.

With this extension, you can compare 2 search engines or 3 search engines at a time. There's a long list of default search engines that you can choose from (including search.cpan.org!). And you can also enter search engines' URLs which don't appear in the default list yourself.

Currently only the sites' raw HTML pages are shown to the user. We'll add more comprehensive and more intuitive views and graphics for the search results in the near future. Please stay tuned!

This project was initiated and has been regulated by the Yahoo! China ( http://cn.yahoo.com ) company and opensourced under the MIT license.

One of our buzzword (for extension developers) is that there's 0 line of XUL/RDF/XML in our project's source tree. The GUI stuff is totally scripted in Perl. Thanks to Jesse Vincent's Template::Declare module on CPAN.

You can always get the latest source code of this project from the following SVN repository:

   http://svn.openfoundry.org/searchall/

If you like to help, please let us know. We're very willing to deliver the commit bit like the Pugs team ;)

The XPI file that can be installed directly into Firefox can also be found here:

   http://svn.openfoundry.org/searchall/trunk/searchall.xpi

There's a XUL application framework named XUL::App sitting in the same repos and SearchAll is already using it. I'd expect to move XUL::App to a separate repos and rename it to a cooler name (maybe Xifty or Xufty?).

Sorry for the lack of documentation. Please see README for some general ideas :)

I've already submitted this extension to addons.mozilla.org and waiting for the editor's approval.

Enjoy!
Monday October 30, 2006
08:59 AM

Notes for this fortnight (2006-10-18 ~ 2006-10-30)

Oct 18 (to Jack Shen~)

I wrote a UML class diagram generator based on GraphViz. it can parse arbitrary perl OO modules and obtain the inheritance relationships and method/attribute list automatically. it's called UML::Class::Simple. And it's much easier to use than StarUML . you know, dragging mouse to draw diagrams is really painful. yay for automatic image generation!

(Here is one of the sample outputs: http://svn.berlios.de/svnroot/repos/unisimu/fast.png.)

Oct 18 (to Sal Zhong~)

i'm planning to upload UML::Class::Simple to cpan once it's mature enough. will you test it for me? bug reports and patches are most welcome. :)

it's still undecided how to differentiate perl classes' properties from other ordinary methods. i'm also pondering the idea of adding relationships other than inheritance. i'll be delighted if you have some ideas on these matters.

Note that i'm ignoring the Autodia module on CPAN since i'm not in favor of XML and a quite different approach has been taken in my project. anyway, i have to admit it's wise to talk to Autodia 's author and merge these efforts. at last, i must thank Alias for creating PPI and suggesting the use of Class::Inspector. they're invaluable when one wants to extract meta info from the perl world.

Oct 19 (to Jack Shen~)

I've merely finished the slides for recap. they already reach the amount of 44 and the number is still counting. alas, still wondering what to say in the next talk on the design of methods and subroutines. :(

Oct 19 (to Cherry Chu~)

Thanks. the talk went pretty well. it's interesting to see that i had the feeling just before the talk that you would not come. so i was not very surprised by your absence. no problem, there's always ``the next time''. :)

i've been busy making slides for tomorrow's talk. they're still not finished yet. sigh. have to make more slides during the daytime tomorrow. producing so many slides is quickly getting tedious. hehe, you know that feeling, right? ;-)

Oct 22 (to He Shan~)

> hi! I've found a book. IT is so nice that i have been
> reading about it all the afternoon. it is great, just
> like an extended version of "The Practice of
> Programming". it's named "Code Complete".

I've got the feeling that you are currently on the *right* way. you'll definitely become a good hacker if you keep going. hmm, hopefully you'll join us perl camels soon. ;)

Oct 22 (to Jack Shen~)

...LOL. apparently you are not a VB guy. inserting images into ppt slides is straightforward once you know how to record down VBA macros in the PowerPoint environment and browsing the generated code in its VB IDE. Another way to get an answer is searching the web. iirc, the method should be AddPicture or something like that. not sure though, computers are out of my reach right now. :(

...Python is even more powerful than MATLAB, Maple, and Haskell? i doubt that. :)

...I was exclusively hacking on the new tokenizer for Makefile::Parser and completely forgot that i had C# classes tonight. anyway, the next major release of M::P takes precedence over any other things. :)

Oct 23 (to Sal Zhong~)

I've just started to rewrite M::P's codebase (which will hopefully be released as M::P 1.00 soon). Yes, it's long overdue. I've had a pretty good plan for a scalable and extensible gmake implementation based on M::P for long.

The new M::P API will offer parsing results at two different levels:

  • Makefile DOM tree

    It's a syntax-oriented data structure which preserves every single bit of info in the original makefile (including whitespaces and comments). So one can modify some part of the DOM tree, and write the updated makefile back to disk. I think it's useful to some GUI apps which want to edit makefiles via menus and is also beneficial to the gmake => PBS translator.

  • Makefile AST

    The AST desugars the handwaving parts of the DOM tree down to a semantic-oriented data structure for make-like tools to ``run'' it or for some visualizer (e.g. my Makefile::Graphviz) to depict the underlying dependency relations. For the PBS emitter, I think we should work out a special AST for it since the desugaring must be lossless, much like a program correctness proving system.

I'm currently working on the M::P tokenizer and will finish the DOM tree constructor these days. The process should be going pretty fast since it is mostly test-driven.

The first goal is to implement the new M::P APIs and get my pgmake utility pass most of the gmake tests so that I can kick M::P 1.00 out of the door.

I'm stealing a lot of source code and pod from Alias's PPI module. I've noticed that the basic structure of PDOM trees can also fit my needs very well. it's called MDOM in my M::P though. ;-)

Oct 24 (to Sun Xin~)

Take care. translating may drive you mad some day. just have appropriate amount of fun, dude!

Oct 26 (to Jack Shen and Sal Zhong~)

my gnu Makefile DOM builder now supports most kinds of rules, 2 flavors of variable assignments, macro interpolations, and various command and comment syntax. Now it's trivial to add new node types and extend the DOM parser.

i'll add support for double-colon rules, the define/vpath/include/ifeq/ifneq/ifdef/ifndef/... directives, and other missing structures tomorrow. After these additions, the DOM parser will be quite complete and will serve as the solid ground that we keep standing on. constructing the Makefile AST will be much easier if we keep a DOM tree handy.

yay for test-driven development! without TDD or Alias' PPI , i wouldn't have progressed so rapidly. ;-)

Oct 29 (to Sal Zhong~)

When and where shall we take the Java exam?

...Oops, it seems impossible to release UML::Class::Simple tonight. still have several missing features to implement and the pod needs loves too. hmm, christopher may be unhappy since i earlier made the promise to him that i would make the release by *this* weekend. sigh. hopefully i'll get some cycles tomorrow.

...nod nod. but i also gotta review the data mining textbooks for the coming exam. furthermore, i'm planning to hack on two expert systems in the next week. i'll be programming in Prolog, CLIPS , and Perl simultaneously, which must be a lot of fun! yay! :D

Oct 30 (to Sal Zhong~)

I've just talked to Alias, the author of PPI , on #perl. he said that i could borrow as much source code from PPI as i would for my Makefile::DOM module. PPI::Element, PPI::Node, PPI::Token, and PPI::Dumper can be reused by my MDOM directly without many changes. i also briefly introduced the two-level ASTs to him and expressed my appreciation of PPI . It has given me plenty of inspiration on how to push my Makefile::Parser further.

This journal was originally posted as http://agentzh.spaces.live.com/blog/cns!FF3A735632E41548!128.entry