miyagawa's Journal http://use.perl.org/~miyagawa/journal/ miyagawa's use Perl Journal en-us use Perl; is Copyright 1998-2006, Chris Nandor. Stories, comments, journals, and other submissions posted on use Perl; are Copyright their respective owners. 2012-01-25T02:03:59+00:00 pudge pudge@perl.org Technology hourly 1 1970-01-01T00:00+00:00 miyagawa's Journal http://use.perl.org/images/topics/useperl.gif http://use.perl.org/~miyagawa/journal/ DBD::SQLite and Unicode http://use.perl.org/~miyagawa/journal/38770?from=rss <p>Attention to anyone using DBD::SQLite and $dbh-&gt;{unicode} attribute set to 1.</p><p>This module has <a href="http://rt.cpan.org/Public/Bug/Display.html?id=25371">a long standing bug</a> where it assumes passed strings internal encoding is UTF-8 when inserting values into the database and I'm trying to fix it.</p><p><code><br>use DBI;<br>use Encode;</code></p><p><code>my $utf8_string = "This is \x{30c6}\x{30b9}\x{30c8}"; # "Test" in Japanese<br>my $utf8_bytes = encode_utf8($string);<br>my $lat1_string = "H\xe9llo World"; # H&#233;llo</code></p><p><code>my $dbh = DBI-&gt;connect("DBI:SQLite:...",<nobr> <wbr></nobr>...);<br>$dbh-&gt;{unicode} = 1;</code></p><p><code>my $sth = $dbh-&gt;prepare("INSERT INTO foo (bar) VALUES (?)";</code></p><p><code>$sth-&gt;execute($utf8_string); # (1) Good<br>$sth-&gt;execute($utf8_bytes); # (2) ???<br>$sth-&gt;execute($lat1_string); # (3) ???<br></code></p><p>Current version of DBD::SQLite (prior to 1.21) assumes given string's INTERNAL encoding as UTF-8 and stores the octet stream into the database without calling encode_utf8 nor utf8::upgrade, so this makes #2 PASS and #3 FAIL (invalid UTF-8 octet in the database), which is not correct.</p><p>My patch solves this, and #2 $utf8_bytes will be now double encoded and FAIL, but #3 PASS with correct UTF-8 octet stream.</p><p>That #2 FAIL might break your (potentially-already-broken) app, when you try to save UTF-8 encoded strings into the database under 'unicode' option, but I believe this is a right fix to make it FAIL.</p><p><a href="http://svn.ali.as/cpan/trunk/DBD-SQLite/t/rt_25371_asymmetric_unicode.t">http://svn.ali.as/cpan/trunk/DBD-SQLite/t/rt_25371_asymmetric_unicode.t</a> is a failing test by Juerd and <a href="http://fisheye2.atlassian.com/changelog/cpan/trunk/DBD-SQLite?cs=6077">http://fisheye2.atlassian.com/changelog/cpan/trunk/DBD-SQLite?cs=6077</a> is my patch to fix that. This patch still passes all tests, including 12_unicode.t and 20_blobs.t, and this makes DBD::SQLite's unicode option compatible to what DBD::mysql's mysql_enable_utf8 option does, etc.</p><p>Note that if you REALLY want to save the octet bytes without being encoded into UTF-8, you can still define the table with BLOB column type and <a href="http://search.cpan.org/~adamk/DBD-SQLite-1.20/lib/DBD/SQLite.pm#Database_Handle_Attributes">use 3-arg bind_param like explained in DBD::SQLite POD</a>. That 'unicode' section continues to be entirely correct with this patch.</p><p>Let me know your input in #dbd-sqlite on irc.perl.org. Testing your app with my patch and reporting it back would be highly appreciated too.</p> miyagawa 2009-04-07T23:51:22+00:00 journal CPAN Timeline http://use.perl.org/~miyagawa/journal/38296?from=rss Have a Google account and Perl hacker friends? Try <a href="http://cpan-timeline.bulknews.net/">CPAN Timeline</a> and see what your friends are hacking on. This could be a proof-of-concept app how to make CPAN more social place. The other approach using the existent social network includes <a href="http://www.facebook.com/apps/application.php?id=5786910451">Leon's facebook CPAN app</a>. miyagawa 2009-01-15T17:57:25+00:00 journal Shibuya.pm Tech Meeting #10 http://use.perl.org/~miyagawa/journal/37942?from=rss <p>Shibuya Perl Mongers tech meeting #10 is full of <a href="http://shibuya.pm.org/blosxom/techtalks/2008011.html">interesting talks</a> and is scheduled on <a href="http://permatime.com/Asia/Tokyo/2008-11-27/18:30">November 27th, Thursday 18:30 in Japan time</a>. There will be <a href="http://www.ustream.tv/channel/shibuya-perl-mongers">ustream.tv</a> streaming available for all talks.</p><p>I will give a short talk and demo about my new toy <a href="http://code.google.com/p/remedie">remedie</a>. This is my answer to the recent video sharing and tv shows buzz over the internet, using my other powerful tool Plagger. I already can't live without this app and I hope I'll talk about it in YAPCs (Asia, NA, EU) and OSCON.</p> miyagawa 2008-11-25T19:45:57+00:00 journal Hawaii, where YAPC::Asia, YAPC::NA and OSDC.au can meet http://use.perl.org/~miyagawa/journal/36848?from=rss Sorry, I just <a href="http://use.perl.org/~brian_d_foy/journal/36845">couldn't resist</a><nobr> <wbr></nobr>:) miyagawa 2008-07-03T20:01:10+00:00 journal Shibuya.pm: tech talks XS night http://use.perl.org/~miyagawa/journal/36718?from=rss <p>Shibuya.pm will have <a href="http://shibuya.pm.org/blosxom/techtalks/200806.html">its 9th technical meeting</a> and the topic of the meeting is XS. No, I'm not joking and all the talks are somehow about XSUB stuff. Let me quote some talks:</p><p>1. My First XS (hirose31)<br>2. Welcome to Perl5 Internals (Daisuke Maki)<br>3. Inside <a href="http://search.cpan.org/~gfuji/Ruby-0.01/">Ruby.pm</a> (Goro Fuji)<br>4. <a href="http://www.slideshare.net/wakapon/ya2008-lt-asa/">PerlMachine</a> (wakapon)</p><p><a href="http://coderepos.org/share/browser/lang/perl/PerlMachine">PerlMachine</a> is a crazy project that is a minimal linux kernel that is designed solely to run perl. He reimplement most perl built-in functions in XSUBs and link them directly to the kernel functions he made. So I assume it's like lispmachine for Perl.</p><p>It is quite exciting that PerlMachine and Ruby.pm talks are done by students from University of Tsukuba. Perl has a bright future!<nobr> <wbr></nobr>:)</p><p>I can't make it to the meeting because I'm back here in San Francisco (so this is the first Shibuya.pm meeting that I can't attend) but I'm sure there will be ustream. Looking forward to it!</p> miyagawa 2008-06-19T02:21:30+00:00 journal Big thanks to Pauley's as well http://use.perl.org/~miyagawa/journal/36460?from=rss <p>Everyone at YAPC::Asia recognizes Dan Kogai as a big contributor to the organizers because he offers his great condo (a.k.a Hotel DAN) to hackers from overseas and also makes the place open for hackathoners. This year Larry and Gloria Wall, jrockway, Yuval, clkao, ingydotnet and Jesse stayed at his place.</p><p>I should now mention that I equally thank Marty and Karen Pauley for hosting Jose (cog), Casey West and Michael Schwern. Karen also demonstrated her awesome typing skills by transcribing Larry and Schwern's keynote. I believe that was really helpful to most Japanese audience to grok what they are talking about.</p><p>Thank you.</p> miyagawa 2008-05-18T13:26:41+00:00 journal My 20 modules talk at YAPC::Asia http://use.perl.org/~miyagawa/journal/36444?from=rss <p>I just started writing my slides midnight before the conference (which is very usual) and trimmed the number of modules I talk about down to 10. I think the talk went really well, especially my evil script to authenticate the wireless access using WWW::Mechanize<nobr> <wbr></nobr>:)</p><p><a href="http://www.slideshare.net/miyagawa/20-modules-i-havent-yet-talked-about/">http://www.slideshare.net/miyagawa/20-modules-i-havent-yet-talked-about/</a></p> miyagawa 2008-05-16T15:10:49+00:00 journal YAPC::Asia 2008 live stream http://use.perl.org/~miyagawa/journal/36421?from=rss <p>YAPC::Asia 2008 just started and is in its full swing.</p><p>For those who cannot make it, we have a stream available on <a href="http://live.yapcasia.org/">http://live.yapcasia.org/</a> thanks to our streaming sponsors.</p><p>Enjoy!</p> miyagawa 2008-05-15T04:41:29+00:00 journal Looking for YAPC::Asia 2009 organizers http://use.perl.org/~miyagawa/journal/36338?from=rss <p><a href="http://yapcasia.org/">YAPC::Asia</a> has become huge. This year we've got 550 registrations and I think this is one of the biggest YAPCs ever.</p><p>However, our organization team has been getting smaller year by year, maybe because we knew we can do this. I live in San Francisco, USA and remotely organize the conference for this 2 years, just like any other project managers do for a project. That means, the real tough work has been done by our staff in Tokyo (I think it's too early to name and thank them since the conference has not even started)</p><p>That said, I'd like to hand off this conference to someone else, since I don't have a plan to get back to Japan anytime soon.</p><p>We're looking for the organizers for YAPC::Asia 2009. It could be any other individuals or preferrably Perl Mongers group in Asia. We've been writing things down what's needed to run this conference and already started writing Post-Mortem document so that this conference could be even better next year. I could still help configuring Act site, doing public relations to guests (if needed) etc. that I can still do remotely.</p><p>Drop me an email (miyagawa at gmail.com) if you're interested. Hopefully we can announce in the closing ceremony in this YAPC::Asia (May 15-16), and people will stop asking me about "When and where is YAPC::Asia 2009?"<nobr> <wbr></nobr>:)</p> miyagawa 2008-05-06T15:02:54+00:00 journal YAPC::Asia 2008 schedule is out http://use.perl.org/~miyagawa/journal/36306?from=rss <p>The schedule of YAPC::Asia 2008 is <a href="http://conferences.yapcasia.org/ya2008/schedule">now out</a>. We have full of interesting talks for 3 tracks on 2 full days. Pretty exciting.</p><p>We also started the call for Lightning Talks today. <a href="http://conferences.yapcasia.org/ya2008/newtalk">Propose yours</a> to speak for 5 minutes whatever you want.</p> miyagawa 2008-05-01T19:08:12+00:00 journal YAPC::Asia 2008 talks announced http://use.perl.org/~miyagawa/journal/36208?from=rss <p>Based on the voting from attendees, we decided the 2nd round of accepted talks. Now we've got 53 talks and they all look so interesting! Go check the list on <a href="http://conferences.yapcasia.org/ya2008/schedule">the schedule page</a>.</p><p>We'll announce the program next week with "Personalized Schedule" functionality built on top of Act hopefully with this weekend Hackathon!</p> miyagawa 2008-04-21T19:40:32+00:00 journal Act Hackathon planned next week http://use.perl.org/~miyagawa/journal/36162?from=rss <p><a href="http://yapcasia.org/">YAPC::Asia 2008</a> organizers would like to thank Eric Cholet, the author of <a href="http://act.mongueurs.net/">ACT</a> for the great conference organizing software that powers most of YAPCs and Perl Workshops.</p><p>To show the appreciation in the hacker's way, I'm flying to <a href="http://www.dopplr.com/trip/miyagawa/167494">Paris, France</a> next weekend (April 25-28) funded by YAPC::Asia possible profit, to work on Act feature enhancement.</p><p>We plan to work on these things because we want them for YAPC::Asia:</p><p>* OpenID provider support<br>* Better Japanese names display (i18n)<br>* Embed videos and slides (YouTube, Google Video, Slideshare etc.) in talks<br>* Personal Scheduling (Who is attending to which talks) like <a href="http://sched.org/">Sched.org</a> or <a href="http://github.com/rabble/icalico/tree/master">icalico</a><br>* Online check-in API (Who actually showed up when)<br>* Promotional code / coupon for discounted payments</p><p>We (at least, I) prioritize implementing these because the trip is funded by YAPC::Asia but if there's anything you think is missing for Act, I'd love to hear. Remote participation (#act on irc.perl.org during the weekend) would be welcome too!</p> miyagawa 2008-04-15T23:00:34+00:00 journal YAPC::Asia 2008 talks announced http://use.perl.org/~miyagawa/journal/35949?from=rss <p><a href="http://yapcasia.org/">YAPC::Asia 2008</a> website got a redesign, along with the announcement of sponsors and the initial set of talks (currently 33 talks and more to come!).</p><p>We have Larry Wall and Michael Schwern as keynote speakers this year. Tickets will go on sale on March 25th Tuesday local time. There's been YAPC::Asia tradition that 300 tickets go sold out in a week, so don't miss it.</p> miyagawa 2008-03-21T05:12:25+00:00 journal Three levels of Perl/Unicode understanding http://use.perl.org/~miyagawa/journal/35700?from=rss <p>(Editorial: Don't frontpage this post, editors. I write it down here to summarize my thought, wanting to get feedbacks from my trusted readers and NOT flame wars or another giant thread of utf-8 flag woes)</p><p>I can finally say I fully grok Unicode, UTF-8 flag and all that stuff in Perl just lately. Here are some analysis of how perl programmers understand Unicode and UTF-8 flag stuff.</p><p>(This post might need more code to demonsrate and visualize what I'm talking about, but I'd leave it as a homework for readers, or at least thing for me to do until YAPC::Asia if there's a demand for this talk<nobr> <wbr></nobr>:))</p><p><strong>Level 1. "Take that annoying flag off, dude!"</strong></p><p>They, typically web application developers, assume all data is encoded in utf-8. If they encounter some wacky garbaged characters (a.k.a Mojibake in Japanese) which they think is a perl bug, they just make an ad-hoc call of:</p><blockquote><div><p> <tt>Encode::_utf8_off($stuff)</tt></p></div> </blockquote><p>to take the utf-8 flag off and make sure all data is still in utf-8 by avoiding any possible latin-1-utf8 auto upgrades.</p><p>This is level 1. Unfortunately, this works okay, assuming their data is actually encoded only in utf-8 (like database is utf-8, web page is displayed in utf-8, the data sent from browsers is utf-8 etc.). Their app is still broken when they call things like length(), substr() or regular expression because the strings are not UTF-8 flagged and those functions don't work in Unicode semantics.</p><p>They can optionally use "use encoding 'utf-8'" or CPAN module <a href="http://search.cpan.org/~audreyt/encoding-warnings">encoding::warnings</a> to avoid auto-upgrades at all, or catch such mistakes, or use <a href="http://search.cpan.org/~taniguchi/Unicode-RecursiveDowngrade-0.03/">Unicode::RecursiveDowngrade</a> to turn off UTF-8 flag on complex data structure.</p><p><strong>Level 2. "Unicode strings have UTF-8 flags. That's easy"</strong></p><p>They make an extensive use of Encode module encode() and decode() to make sure all data in their app is UTF-8 flagged. Their app works really nice in Unicode semantics.</p><p>They sometimes need to deal with UTF-8 bytes in addition to UTF8-flagged strings. In that case, they use some hacky modules named <a href="http://search.cpan.org/search?query=forceutf8&amp;mode=all">ForceUTF8</a>, or do things like</p><blockquote><div><p> <tt>utf8::encode($_) if utf8::is_utf8($_)</tt></p></div> </blockquote><p>to assume that "Unicode strings should have UTF-8 flagged, and those without the flag are assumed UTF-8 bytes."</p><p>This is Level 2. This is a straight upgrade from Level 1 and fixes some issues of Level 1 (string functions not working in Unicode semantics, etc.), but it's still too UTF-8 centric. They ignore why perl5 treats strings this way, and still hate SV Auto-upgrade.</p><p>To be honest I was thinking this way until, like early 2007. There's a couple of my modules on CPAN that accepts both UTF-8 flagged string and UTF-8 bytes, because I thought it'd be handy, but actually that breaks latin-1 strings if they're not utf-8 flagged, which is rare in UTF-8 centric web application development anyway, but still could happen.</p><p>I gradually have changed my mind when I talked about how JSON::Syck Unicode support is broken with <a href="http://search.cpan.org/~mlehmann/">Marc Lehmann</a>, and when I read the tutorial by and attended to the Perl Unicode tutorial talk by <a href="http://search.cpan.org/~juerd/">Juerd Waalboer</a> in YAPC::EU.</p><p><b>Level 3. "Don't bother UTF-8 flag"</b></p><p>They stop guessing if a variable is UTF-8 flagged or not. All they need to know is that a string is whether bytes or characters, by checking how a scalar variable is generated.</p><p>If it's bytes, use decode() to get Unicode strings. If it's characters, don't bother if it's UTF-8 flagged or not: if it's not flagged they'll be auto-upgraded thanks to Perl, so you don't need to know the internal representations.</p><p>So it's like a step back from Level 2. "Get back to the basic, and think why Perl 5 does this latin-1 to utf-8 auto upgrades."</p><p>If your function or module needs to accept strings that might be either characters or bytes, just provide 2 different functions, or some flag to explicitly set. Don't auto-decode bytes as utf-8 because that breaks latin-1 characters if they're not utf-8 flagged. Of course the caller of the module can call utf8::upgrade() to make sure, but it's just a pain and anti-perl5 way.</p><p>There's still a remaining problem with CPAN modules, though. Some modules return strings in some occasion and not otherwise. For instance, $c-&gt;req-&gt;param($foo) would return UTF-8 flagged string if Catalyst::Plugin::Unicode is loaded and bytes otherwise. And using utf8::is_utf8($_) here might cause bugs like described before.</p><p>Well, in C::P::Unicode example, actually not. using C::P::Unicode guarantees that parameters are all utf-8 flagged even if the characters contain latin-1 range characters. Not using the plugin guarantees the parametes are not flagged at all. So it's a different story.</p><p>(To be continued...)</p> miyagawa 2008-02-20T10:28:27+00:00 journal Submit your talks to YAPC::Asia 2008 http://use.perl.org/~miyagawa/journal/35688?from=rss YAPC::Asia 2008 proposal deadline is 2/25, one week away. <a href="http://conferences.yapcasia.org/ya2008/">Submit your talk</a> now. We welcome JavaScript related talks as well as anything Perl. miyagawa 2008-02-18T22:14:55+00:00 journal OSCON talk http://use.perl.org/~miyagawa/journal/35556?from=rss <p>Wondering what talk I should submit to OSCON (and other YAPCs this year too!).</p><p>The obvious choice is Web::Scraper since I haven't done this talk other than Europe and Japan, and I can make lots of updates till summer when I give an actual talk (We call it CDD -- Conference Driven Development)</p><p>Any suggestions?</p> miyagawa 2008-02-01T23:11:12+00:00 journal URI::Find::UTF8 -- Fun with Safari users http://use.perl.org/~miyagawa/journal/35428?from=rss <p><a href="http://search.cpan.org/dist/URI-Find/">URI-Find</a> is a great module to extract URIs from an arbitrary text, but unfortunately, it doesn't work with non-ascii URLs that we often encounter when chatting with Safari users, such as: http://ja.wikipedia.org/wiki/&#12513;&#12452;&#12531;&#12506;&#12540;&#12472;</p><p>The reason why Safari users sometimes do this is that Safari shows the URI-decoded path in its location bar.</p><p>I hacked and uploaded URI::Find extension (subclass) <a href="http://search.cpan.org/dist/URI-Find-UTF8">URI::Find::UTF8</a> which can be a drop-in replacement for URI::Find, to extract URLs like this.</p><p>We have a <a href="http://svn.coderepos.org/share/lang/perl/URI-Find-UTF8/trunk/">subversion repository</a> too, if you want to take a look and found a bug and patch the code.</p> miyagawa 2008-01-19T00:44:50+00:00 journal ActiveSupport equivalent to Perl http://use.perl.org/~miyagawa/journal/35396?from=rss <p><strong>UPDATE:</strong> The module was originally written using constant overloading, but it is a dangerous and gross hack, so I changed that to use autobox framework instead (wondering why I didn't try that at first!). I updated the post accordingly.</p><p>Rails has ActiveSupport, something to add funky methods to Ruby core object, to do fancy things like <a href="http://api.rubyonrails.org/classes/ActiveSupport/CoreExtensions/Numeric/Time.html">2.months.ago to get Time duration object</a> etc.</p><p>I found it pretty interesting and wondered if it's doable in Perl. Yes it is, with using <a href="http://search.cpan.org/~chocolate/autobox-1.22/">autobox framework</a> which I hope is going to be in core in perl 5.12, or using constant overloading like bigint.pm does.</p><p>So here you are: <a href="http://search.cpan.org/dist/autobox-DateTime-Duration/">autobox::DateTime::Duration on CPAN</a> and <a href="http://svn.coderepos.org/share/lang/perl/autobox-DateTime-Duration/trunk/">SVN repository</a> if you can't wait CPAN mirrors updates. With this you can say:</p><p><code><br>use autobox;<br>use autobox::DateTime::Duration;</code></p><p><code>print 1-&gt;day-&gt;ago, "\n"; # 2008-01-14T23:25:53<br>print 2-&gt;minutes-&gt;from_now, "\n"; # 2008-01-15T23:28:20<br></code></p><p>and all methods implemented in ActiveSupport::CoreExt::Numeric::Time, including this crazy <i>fortnight</i> method. Since it's a standard DateTime::Duration object, you can also say this to save some typings:</p><p><code><br>my $now = DateTime-&gt;now;<br>my $dur = 3-&gt;hours + 2-&gt;minutes;<br>$now-&gt;add_duration($dur);<br></code></p><p>This might be a fun birthday gift for <a href="http://use.perl.org/~autarch/journal/35394">DateTime's 5th birthday</a><nobr> <wbr></nobr>:)</p> miyagawa 2008-01-15T23:27:23+00:00 journal Honolulu.pm http://use.perl.org/~miyagawa/journal/35358?from=rss <p>My friend Toru Hisai, who has joined us at Shibuya.pm tech meetings in Tokyo a lot, has recently moved to Honolulu, Hawaii and he's now trying to start a local Perl user group there: <a href="http://honolulu.pm.org/">Honolulu.pm</a>. Hawaii.pm appears to have been there for really a long time but it turns out the website is way outdated and the contact on the site is bouncing, so I suggested him to start his own.</p><p>This might be a significant step for us towards YAPC::Hawaii? hint, hint.</p> miyagawa 2008-01-11T04:50:44+00:00 journal SF.pm lightning talk http://use.perl.org/~miyagawa/journal/34992?from=rss So I went down to <a href="http://sf.pm.org/">SF.pm</a> meeting and gave two lightning talks about <a href="http://www.slideshare.net/miyagawa/webscraper-for-sfpm-lt/">Web::Scraper</a> and <a href="http://www.slideshare.net/takesako/shibuyapm8">takesako-san's neat IMG tag hackery</a>. These talks went well and other talks were interesting too. Photos uploaded to <a href="http://www.flickr.com/photos/bulknews/tags/sfpm/">Flickr tagged sf.pm</a>. miyagawa 2007-11-28T08:15:30+00:00 journal Web::Scraper talk in SF.pm lightning talks 11/27 http://use.perl.org/~miyagawa/journal/34982?from=rss <p>I'm gonna give a 5 minute brief talk about Web::Scraper in <a href="http://sf.pm.org/weblog/">SF.pm</a> meeting tomorrow night (11/27 7pm SOMA).</p><p>It appears that you need to be a member of SF.pm mailing list to attend to the meeting due to the venue policy etc., but if you wanna join, let me know so I can talk to the organizer!</p> miyagawa 2007-11-26T22:26:48+00:00 journal Web::Scraper (HTML::TreeBuilder::XPath) slowdown on Fedora http://use.perl.org/~miyagawa/journal/34970?from=rss <p>Today I had an interesting report from Web::Scraper user, saying that he has a script that runs really quick (less than 1 sec) on Macbook but so slow (50 secs) on AMD dual CPU machine. Here's the dprof report:</p><p><code><br>Total Elapsed Time = 47.32165 Seconds<br> &nbsp; &nbsp; User+System Time = 31.07165 Seconds<br>Exclusive Times<br>%Time ExclSec CumulS #Calls sec/call Csec/c Name<br> &nbsp; 51.6 16.03 16.033 6922 0.0023 0.0023 XML::XPathEngine::NodeSet::new<br> &nbsp; 13.5 4.208 4.208 1777 0.0024 0.0024 XML::XPathEngine::Boolean::True<br> &nbsp; 13.0 4.048 4.048 1723 0.0023 0.0023 XML::XPathEngine::Literal::new<br> &nbsp; 11.3 3.518 3.518 1666 0.0021 0.0021 XML::XPathEngine::Boolean::False<br></code></p><p>We initially thought it's due to some XS module library issues with dual CPU, but it turned out he was using perl that comes with Fedora, and the rpm version he uses is 5.8.8-10.</p><p>As <a href="https://bugzilla.redhat.com/show_bug.cgi?id=196836">addressed in RH/Fedora bugzilla</a>, perl 5.8.8 rpm prior to 5.8.8-22 has a nasty patch that makes all perl's new() (or bless) call in classes with overloaded methods really slow. HTML::TreeBuilder::XPath (hence Web::Scraper) creates a lot of Nodes on HTML pages and XML::XPathEngine::NodeSet definitely has an overloaded function.</p><p>So this is really due to Fedora Perl's patch. If you run into the same issue with Fedora, check your rpm version and upgrade to the latest, or build your own perl which is always a good thing.</p> miyagawa 2007-11-25T23:37:13+00:00 journal Web::Scraper recipe: download substitles from wikisubtitles http://use.perl.org/~miyagawa/journal/34932?from=rss <p>This extracts subtitle links from <a href="http://wikisubtitles.net/">WikiSubtitles</a> Ajax episode links.</p><blockquote><div><p> <tt>#!/usr/bin/perl<br>use strict;<br>use Web::Scraper;<br>use URI;<br> &nbsp; <br>my $uri = URI-&gt;new("http://wikisubtitles.net/ajax_loadShow.php?show=65&amp;season=3");<br>my $scraper = scraper {<br>&nbsp; &nbsp; process '//td[@class="idioma"][text()=~"English \(US\)"]/..//a', 'links[]' =&gt; '@href';<br>};<br>my $result = $scraper-&gt;scrape($uri);</tt></p></div> </blockquote><p>You can paste the URLs to Speeddownload and it's now all set!</p> miyagawa 2007-11-19T18:23:36+00:00 journal Web::Scraper now has nth-child(N) support in CSS selector http://use.perl.org/~miyagawa/journal/34874?from=rss <p>Thanks to <a href="http://search.cpan.org/~tokuhirom">tokuhirom</a>, HTML::Selector::XPath now has added a support of nth-child(N) CSS selector. Hence Web::Scraper can make use of it as well.</p><p>The new release 0.03 is going to CPAN mirrors shortly.</p> miyagawa 2007-11-11T04:48:05+00:00 journal Tagging CPAN changes http://use.perl.org/~miyagawa/journal/34850?from=rss <p><strong>Question:</strong> Is it possible to annotate/tag each CPAN module update so that we can figure out if the update contains "security fix", "minor bug fix" or "major API change" etc.?</p><p><strong>Context:</strong> At <a href="http://www.sixapart.com/">work</a> we have a repository of third party CPAN modules that we use on Vox or TypePad. Once a module is added to the list, we manually follow the changes of each module to figure out if we need to upgrade (ala fix for major bugs, security issues, memory leaks etc.) or not to upgrade (ala backward incompatible API changes etc.)</p><p>It generally works well but sometimes we upgrade a module without knowing that it might break our code. In that case we take a look at how hard it is to update our code to follow the module change, and if it's not that easy, we simply revert the upgrade.</p><p>So, I think it's nice if we can automatically or even semi-automatically know, given module XXX-YYY version M to N, what kind of changes the upgrade will contain, without manually looking at Changes and diffing its source code. Note that I'm not saying these audit processes are worthless, but if we know what amount of change the upgrade introduces, it makes the work a bit easier.</p><p>Here are two possible solutions:</p><p>1) Having a rough standard to indicate these "minor bug fix", "security fix" or "major API change" type of thing in Changes file.</p><p>I know CPAN is not a place that we can force all module authors to follow one giant "standard", but we already have some kind of standardization on CPAN modules versioning: if the release is a developer release that "normal" user shouldn't upgrade, we add "_" in the version number so CPAN ecosystem will ignore it. Could we introduce more things similar to this, to tag each module update?</p><p>I realize that it's not easy because most authors write Changes file in a free text format. Some authors use more structured formats like <a href="http://search.cpan.org/src/INGY/YAML-0.66/Changes">YAML</a>, <a href="http://search.cpan.org/~timb/DBI-1.601/Changes">POD</a> or <a href="http://search.cpan.org/src/ASCOPE/Net-Flickr-Backup-2.99/Changes">n3/RDF(!)</a>, but I myself don't like to do that. Hm, maybe YAML is accetable.</p><p>Anyway, if that doesn't sound realistic, I have another solution in my mind, 2) to have a Wiki/del.icio.us-like website where anyone can tag any module release. It might sound a bit more Web 2.0 way to accomplish the original purpose<nobr> <wbr></nobr>:)</p><p>We probably want to integrate the user authentication with PAUSE/BitCard so that we can say "this release is tagged 'minor bug fix' by the author."</p><p>Thoughts?</p> miyagawa 2007-11-07T06:49:06+00:00 journal Web::Scraper hacks #3: Read your browser's cookies http://use.perl.org/~miyagawa/journal/34754?from=rss <p>Some websites require you to login to the site using your credential, to view the content. It's easily scriptable with WWW::Mechanize, but if you visit the site frequently with your browser, why not reusing the browser's cookies, so as you don't need to script the login process?</p><p>Web::Scraper allows you to call methods, or entirely swap its UserAgent object when it scrapes the website. Here's how to do so:</p><blockquote><div><p> <tt>use Web::Scraper;<br>use HTTP::Cookies::Guess;<br> &nbsp; <br>my $cookie_jar = HTTP::Cookies::Guess-&gt;create(file =&gt; "/home/miyagawa/.mozilla/cookies.txt");<br>my $s = scraper { };<br>$s-&gt;user_agent-&gt;cookie_jar($cookie_jar);<br>$s-&gt;scrape($uri);</tt></p></div> </blockquote><p>This snippet uses <a href="http://search.cpan.org/~yappo/HTTP-Cookies-Guess-0.01/">HTTP::Cookies::Guess</a> which provides you a common API to read browser's cookie files (the module supports IE, Firefox, Safari and w3m) and set the cookie jar to the UserAgent object.</p><p>If you'd like to change the behavior globally, you can also do:</p><blockquote><div><p> <tt>$Web::Scraper::UserAgent-&gt;cookie_jar($cookie_jar);</tt></p></div> </blockquote><p>In either way, you can avoid coding your username and password in the scraping script, which is a huge win.</p> miyagawa 2007-10-25T19:18:43+00:00 journal Better CPAN RSS feed http://use.perl.org/~miyagawa/journal/34709?from=rss <p><a href="http://search.cpan.org/">search.cpan.org</a> has an RSS feed for <a href="http://search.cpan.org/recent">recently uploaded modules</a> but there's only one minor problem: the feed doesn't have rich metadata.</p><p>Daisuke Murase (aka <a href="http://search.cpan.org/~typester/">typester</a> on CPAN and IRC) created a site called <a href="http://unknownplace.org/cpanrecent/">CPAN Recent Changes</a> a while ago and it's been really useful for people tracking activities on CPAN.</p><p>The feature the site provides is very simple: "a better recent change log for CPAN". The site tracks the recently uploaded modules from search.cpan.org and grabs Changes file and takes diff against the previous version, so you can see what's changed in the release (unless the author is too lazy to update the Changes file). The site of course publishes <a href="http://unknownplace.org/cpanrecent/rss">the RSS feed</a> for the recent uploads to CPAN, with the changes in the summary field, so you can keep an eye on it without clicking the link to see what's changed.</p><p>You can also follow changes in different views, like modules under the specific namespaces (e.g. <a href="http://unknownplace.org/cpanrecent/Catalyst">Catalyst</a> or <a href="http://unknownplace.org/cpanrecent/DBIx-Class">DBIx-Class</a>) or modules uploaded by specific authors (e.g. <a href="http://unknownplace.org/cpanrecent/author/miyagawa">me</a> or <a href="http://unknownplace.org/cpanrecent/author/ingy">Ingy</a>) and they all come with RSS feeds too.</p> miyagawa 2007-10-18T01:38:44+00:00 journal YAPC::Hawaii http://use.perl.org/~miyagawa/journal/34672?from=rss <p>I've been dreaming (with a couple of folks like clkao) about having a YAPC in Hawaii. Hawaii is a great place for everyone to come, from the west/mid coast of USA, East Asia (Japan, Taiwan) and Oceania (Australia, NZ). It's gonna be a great place for attendees to bring their wives and GFs. The conference would begin early morning and should finish like 3pm so we can get to the beach.</p><p>Since Hawaii is not part of the North America continent, it shouldn't be YAPC::NA. YAPC::Pacific.</p><p>I don't know where to start. Is there a Perl monger in Hawaii? <a href="http://hawaii.pm.org/">Hawaii.pm.org</a> is there but the site seems way outdated.</p> miyagawa 2007-10-13T19:59:43+00:00 journal Regexp::Debug? http://use.perl.org/~miyagawa/journal/34663?from=rss <p>Lazyweb,</p><p>Is there a module to debug your regular expression, to compare the target string and an input regular expression one byte by one? It'd be useful if you have an existent code to do a pattern match against a big chunk of string and don't know why it doesn't match.</p><blockquote><div><p> <tt>use Regexp::Debug;<br> &nbsp; <br>my $string = "abcdefg";<br>my $regexp = qr/abcefg/; # Notice 'd' is missing<br> &nbsp; <br>my $result = Regexp::Debug-&gt;compare($string, $regexp);<br> &nbsp; <br># $result would be an object or a string to<br># indicate that the regexp stopped matching at 'abc'</tt></p></div> </blockquote><p>It's the thing we regularly do, when the regular expression based screen scraping tool (BAD! Use Web::Scraper instead!) stops working. I open up 2 terminal screens, one with HTML output and one with regular expression. In the worst case I split the regular expression in a binary tree search fashion to find where it's broken.</p> miyagawa 2007-10-12T19:15:54+00:00 journal Web::Scraper with filters, and thought about Text filters http://use.perl.org/~miyagawa/journal/34607?from=rss <p><a href="http://search.cpan.org/dist/Web-Scraper-0.21_01/">A developer release of Web::Scraper</a> is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.</p><p>Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.</p><p>For instance, if you have an HTML</p><blockquote><div><p> <tt>&lt;span class=".entry-date"&gt;2007-10-04T01:09:44-0800&lt;/span&gt;</tt></p></div> </blockquote><p>you can get the DateTime object that the string represents, like:</p><blockquote><div><p> <tt>&nbsp; process ".entry-date", "date" =&gt; sub {<br>&nbsp; &nbsp; DateTime::Format::W3CDTF-&gt;parse_string(shift-&gt;as_text);<br>&nbsp; };</tt></p></div> </blockquote><p>and with 'filters' you can make this reusable and stackable, like:</p><blockquote><div><p> <tt>package Web::Scraper::Filter::W3CDTFDate;<br>use base qw( Web::Scraper::Filter );<br>use DateTime::Format::W3CDTF;<br> &nbsp; <br>sub filter {<br>&nbsp; &nbsp; DateTime::Format::W3CDTF-&gt;parse_string($_[1]);<br>}<br>1;</tt></p></div> </blockquote><p>and then:</p><blockquote><div><p> <tt>&nbsp; process ".entry-date", date =&gt; [ 'TEXT', 'W3CDTFDate' ];</tt></p></div> </blockquote><p>If the<nobr> <wbr></nobr>.entry-date text contains errorneous spaces, you can do:</p><blockquote><div><p> <tt>&nbsp; process ".entry-date", date =&gt; [ 'TEXT', sub { s/^ *| *$//g }, 'W3CDTFDate' ];</tt></p></div> </blockquote><p>This explains how powerful this Web::Scraper filter mechanism could be. It's stackable, extensible, reusable (by making it a module) and also scriptable with inline callbacks.</p><p>So the next step would be to add bunch of Web::Scraper::Filter::* modules. I think I'll create a separate distribution Web::Scraper::Filters and give everyone a commit access so you can add your own text filters that you can share.</p><p>However I have another, more ideal solution in my mind.</p><p>The problem is: there are already lots of text filters on CPAN. URI::Escape, HTML::Entities, MIME::Base64, Crypt::CBC, LOLCatz, Kwiki::Formatter... name a few.</p><p>And there are also text processing framework that has filter mechanism: Template-Toolkit, Web::Scraper, Plagger, Kwiki, Test::Base<nobr> <wbr></nobr>... name a few. Obviously the combination of a new text filter engine and these text processing system goes exponential.</p><p>For instance, TT has gadzillion of Template::Filter plugins on CPAN that are only useful for TT. If you want to use that text filter in other text processing system (e.g. Web::Scraper, Kwiki, Plagger etc.), you need to port, or in other words, write an adapter interface for <i>each</i> individual text filter engine.</p><p>Doesn't this suck?</p><p>I want a common Text filter API that can take input as a string, and return output also as a string. For complex filters like wiki-to-text engine, this might better have a configuration option.</p><blockquote><div><p> <tt>use Text::Filter::Common;<br>my $filter = Text::Filter::Common-&gt;new($name, $config);<br>my $output = $filter-&gt;filter($input, $option);</tt></p></div> </blockquote><p>So Text::Filter::Common is a factory module and each text filter is a subclass of Text::Filter::Common::Base or something and implements <code>filter</code> function that probably takes <code>$self-&gt;config</code> to configure the filter object.</p><p>Then we can write an adapter interface for existent text filter mechanisms like Web::Scraper or Tempalte::Toolkit, and we can avoid the duplicated efforts to re-port one text filters to bunch of different modules.</p><p>Looks like Text::Filter namespace is taken and even though it seems close to what I want it to do, but it supports both read/write and that's more than what I want.</p><p>Thoughts?</p> miyagawa 2007-10-04T08:20:33+00:00 journal