Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

schwern (1528)

schwern
  (email not shown publicly)
http://schwern.net/
AOL IM: MichaelSchwern (Add Buddy, Send Message)
Jabber: schwern@gmail.com

Schwern can destroy CPAN at his whim.

Journal of schwern (1528)

Sunday December 13, 2009
02:36 AM

MSCHWERN has a PAUSEID

ANDRE has a posse.

MSCHWERN has a PAUSEID.

Do you?

Saturday December 12, 2009
05:02 PM

gitPAN and the PAUSE index

As you may or may not know, people on CPAN own modules (technically they own the namespace). Each Foo::Bar is owned by one or more CPAN accounts. Usually you gain ownership on a "first-come" basis, but it can also be transferred. Only the "official" tarball for a given namespace is indexed. So if the owner of Foo::Bar uploads Foo-Bar-1.23.tar.gz Foo::Bar will point at Foo-Bar-1.23.tar.gz. If I (presumably unauthorized) upload Foo-Bar-1.24.tar.gz the index will still point at Foo-Bar-1.23.tar.gz.

Here's the rub. Not owning a module doesn't stop you from uploading. It also says nothing about who owns the distribution. gitpan is by distribution. Now it gets a little more difficult to figure out who owns what. For example, look at MQSeries-1.30. All but two modules are unauthorized. BUT notice that MQSeries.pm is authorized. The CPAN index does point MQSeries at M/MQ/MQSERIES/MQSeries-1.30.tar.gz (everything else is at 1.29). Likely what we have here is a botched ownership transfer.

How do you mark that? search.cpan.org seems to take the strict approach, if anything's unauthorized its out. The CPAN uploads database I have available is the opposite, if anything is authorized its in. What to do?

Then there's stuff like lcwa. Looks like junk, but here's the thing. CPAN has a global module index to worry about, gitpan doesn't. Each distribution is its own distinct unit. So lcwa does no harm on gitpan, it can be recorded.

What does matter? The continuity of a distribution's releases, and this is precisely what CPAN does not track. It doesn't even have a concept of a distribution, just modules inside tarballs. CPAN authors playing nice with tarball naming conventions gives the illusion of a continuous distribution.

So... for a given release of a distribution (ie. a tarball), how does gitpan determine if the release should be included in the distribution's history? If we go strict, like search.cpan.org, we're going to lose legit releases and even entire distributions (like lcwa). If we let anything in gitpan is not showing an accurate history.

Add the complication that authorization changes. For example, the MQSeries module ownership will eventually be fixed. What then?

First pass through, gitpan is ignoring this problem. Its just chucking everything from BackPAN in. Second pass will rebuild individual repos with collected improvements. This is the first thing I'm not sure what to do about.

Suggestions?

Friday December 04, 2009
02:33 AM

gitPAN's first success story

Sitting next to David Wheeler at a bar, he co-maintains Pod::Simple. The repo is on github. Previously it was in SVN. Before that, Sean Burke's hard drive. The SVN repo was imported into git, but as Sean had no repo they're left with a history gap. He wants that history back.

I imported Pod-Simple into gitPAN for him, then went about pasting his repository on top of gitPAN's. This means a rebase. First, we fetched the gitpan repo into his repository.

        git remote add gitpan git://github.com/gitpan/Pod-Simple.git
        git fetch gitpan

Then we find the first commit to David's repo and note the date: Nov 18th, 2005. We find the commit just before that in gitpan/master, 3.02, and its tagged 3.02. Then rebase all of David's repo on top of that tag.

        git rebase --onto 3.02 --root master

That replays all of master on top of the tag 3.02 from gitpan. Ta da! Done. You can remove the gitpan remote.

        git remote rm gitpan

As a final bit of cleanup, we made sure all the release tags after 3.02 are pointing to David's commits and not gitPAN's. I'll leave retagging as an exercise for the reader.

Push that (has to be forced, since its not a fast-forward) and it done.

        git push origin master -f

gitPAN is currently using lightweight tags, so they have to be pushed too.

        git push --tags

Pod-Simple's history is complete.

Thursday December 03, 2009
06:55 PM

gitPAN

If you're like me, and I know I am, you've often wondered things about other people's CPAN modules like: what changed in this release; when did this bug/feature get introduced; where's that old version that got deleted off CPAN?

search.cpan.org provides some web tools, which is very cool, but pointy-clickies only go so far. What you really want is a repository of releases.

Sometimes you can find the project's repository, usually involving digging through the documentation. Now projects are starting to use the repository resource in their metadata and search.cpan.org links to it so things have gotten a little better. And maybe its complete, or maybe the history cuts off where the last maintainer took over. And maybe they've tagged their releases in some sane way.

Wouldn't it be nice if every CPAN distribution had a repository of all their releases, all tagged the same way? The idea has been kicking around for a while. Eric Wilhelm took a stab at it with Subversion, but its less than trivial to get a useful history out of a pile of tarballs with SVN. And then where do you host it?
Turns out git makes this process trivial. You delete all the files, unpack the new release, and commit it all. Git figures out what moved, what got deleted, what got added, etc. So that's one part solved.

Then brian d foy has been working on indexing BackPAN. Leon made a module to access this index, Parse::BACKPAN::Packages. That's another part solved.

Yanick developed a pile of tools to make turning a CPAN distribution into a git repository easy including one to import all the releases from BackPAN. Put a loop around that and call it done.

Finally, hosting. I'm never one to DIY system administration, so plop it on github. Their APIs make creating repositories trivial and their web site provides far more functionality than I'd ever want to maintain. And, perversely, once I get tagging working you can download tarballs! Unfortunately BackPAN is about 20 gigs and while the size of the resulting git repos is looking to be far smaller (projects with a lot of releases come out much smaller, projects with few releases come out a little larger) it still bleads well over their 300M free account limit. Hopefully they'll be receptive to a little begging.

I give you gitPAN, a (soon to be) complete set of repositories for all of BackPAN. The process is fully automated, but I'm still tweaking things and the available repositories are sporadic. There's a lot of optimization and small corrections which needs done, my tweaked versions of Parse::BACKPAN::Packages and Git::CPAN::Import are available.

There are two open problems. First, I haven't even looked into how to keep the repositories up to date. There's some new indexes on BackPAN as part of the File::Rsync::Mirror::Recent mirroring optimization Andreas has been working on which will probably prove useful. If code suddenly appeared to handle that that would be great.

Second, I know of no historical index of authorized releases. This means gitPAN will just pull in everything on BackPAN causing a slightly skewed history. If a solution to that appeared, that too would be great.

I don't have any clear idea of what this might be used for, nothing to justify its scale. But I figure make the data available and someone will do something awesome with it. "If you build it they will come." If nothing else it'll make patching easier, I've already started generating gitPAN repos for modules I'm about to patch and cloning that to work on, but hopefully this will be more than an extended yak shaving exercise.

Wednesday July 15, 2009
04:56 PM

I don't understand why people think Unix is hard.

$ pax --help
pax: illegal option -- -
usage: pax [-cdnvzO] [-E limit] [-f archive] [-s replstr] ... [-U user] ...
           [-G group] ... [-T [from_date][,to_date]] ... [pattern ...]
       pax -r [-cdiknuvzDOYZ] [-E limit] [-f archive] [-o options] ...
           [-p string] ... [-s replstr] ... [-U user] ... [-G group] ...
          [-T [from_date][,to_date]] ...  [pattern ...]
       pax -w [-dituvzHLOPX] [-b blocksize] [ [-a] [-f archive] ] [-x format]
           [-B bytes] [-s replstr] ... [-o options] ... [-U user] ...
           [-G group] ... [-T [from_date][,to_date][/[c][m]]] ... [file ...]
       pax -r -w [-diklntuvDHLOPXYZ] [-p string] ... [-s replstr] ...
           [-U user] ... [-G group] ... [-T [from_date][,to_date][/[c][m]]] ...
           [file ...] directory

(just for fun, one of those lines is hard tab indented)

04:45 PM

No, its not about make.

In another edition of "I cut and paste from my email", I was responding to a user who doesn't understand what the problem is with using Makefiles to generate Perl modules.

Well, he asked...

-----------------------------


> I could never follow the arguments against Makefiles or Makefile generators.

Fortunately I wrote a talk alllll about it called
MakeMaker Is DOOMED!

The make dependency is only part of the problem. There's a small mountain of compatibility issues (which make dialect? which shell? what tools are available? are they GNU or BSD or some broken 3rd party thing? did you know that according to POSIX "cp -r" is deprecated?! or that it considers tar a legacy tool and we should all use something called pax?) in which you wind up rewriting a lot of things in Perl anyway... painfully squeezed through a shell one-liner like dough through a pasta maker.

A shell that often can only safely handle 1024 characters (some versions of Windows, old Unixen, VMS only gives us 255)... oh, and you don't really know how long that command might wind up being because it contains make variables which can change at runtime so you just sort of have to guess and hope nobody uses too many modules or something.

And god forbid you have a filename with a space or special character in it. Now you need to escape everything... but you don't know how a variable is going to be used. Maybe it'll be used by the shell...


        $(CMD) $(VARIABLE)

Maybe it'll already be quoted.


        $(PERL) -e 'print $(VARIABLE)'

I'm sorry, I got the wrong quoting. Single quotes don't work on most Windows makes so I have to double quote them.


        $(PERL) -e "print $(VARIABLE)"

But double quotes expand variables on Unix, and I don't know what's inside $(VARIABLE). What a conundrum! Now I need to write a portable method to write portable one-liners so I can put portable Perl code in my very unportable Makefile.

And what's the odds a module author is going to consider any of that when they extend MakeMaker?

Even if you wrote your own make-like build tool that solved all the compatibility problems and everyone magically had installed you'd still be up shit creek. You're taking a dynamic situation with lots of state and turning it into a bunch of static one liners with no state. Its like writing down the instructions about how to build a custom tailored car for an idiot child to do. You wind up saying "fuck it, we're not even going to try doing that".

See this recent perfectly sensible ticket I had to reject for a concrete example of something which should be easy but is almost impossible in MakeMaker.

Finally, to customize how their modules are built you're asking Perl programmers to write make (through the funhouse mirror of MakeMaker), something they're totally unfamiliar with. The days of Perl programmers being old C programmers is long gone. Any alternative build system has the same problem except NOBODY will know it. Perl programmers know how to write Perl.

Aren't you glad you asked?

There is one thing MakeMaker has over Module::Build. You can skim the Makefile to figure out what its going to do (except for all the logic that went into generating the Makefile in the first place). This makes it more visible to those who know make, for everyone else its just more magical gibberish. Module::Build is more visible to those who know Perl and know to look in Module::Build::Base. It would be nice of MB had an equivalent to "make -d" to report what's going on, what actions are being called and what dependencies are being resolved.

Tuesday July 07, 2009
06:29 PM

perl/vendor/site explained

As part of answering a MakeMaker ticket I wrote out a fairly solid explanation of what the three different install locations mean for CPAN modules.

> (Personally I've always found the perl/site/vendor distinction and the
> triplicated set of directories to be fairly impenetrable :-), beyond
> that only 1 of the 3 at various times did something like I thought I
> wanted!)

Part of the problem is the whole philosophy is never fully explained. Part of it is that until 5.12 the default look up order is wrong.

Its pretty simple. Think of it like MacPorts or fink which doesn't control the entire operating system. You effectively have three managers working at the same time. You have the user installing things by hand. You have the package manager. And then there's the operating system.

In a flat layout they'll stomp all over each other. You might install an upgraded program from MacPorts and then have the next OS upgrade downgrade it again. The user might install the latest version from a tarball and then MacPorts installs an older version on top of it next upgrade.

"site" is what gets installed by the user, like /usr/local.

"vendor" is what gets installed by the package manager. /opt/local for MacPorts, /sw for fink.

"core" is what originally shipped with the operating system (or in Perl's case with Perl itself), like /usr.

You look in /usr/local first, then /opt/local, then /usr. In fact, that's how my PATH is set up. Unfortunately Perl itself incorrectly puts "core" first. Fortunately Debian (and I think Redhat now) fixes that. And 5.12 will finally fix it for all.

Packagers should be setting INSTALLDIRS=vendor. The CPAN shell should be using INSTALLDIRS=site. Nothing should be using perl/core. The broken lookup order complicates this because upgrades to dual-life CPAN modules have to go into core else they're not found, but Debian users don't have to worry about that.

Saturday June 20, 2009
03:55 AM

The best diagram about git evar

Leto and I got chromatic to actually use github today at OS Bridge. In the process of explaining it to him I drew up the most useful diagram of git you will ever see. It illustrates the five layers (stash, working copy, staging area, local repo, remote repo), how you move changes between them and what layers different invocations of diff act on.

I wish someone had shown me this months ago.

UPDATE: Of course I'm not the first person to think of this. Here's a much cleaner version of what I did from an article about the git workflow.

Sunday June 14, 2009
09:03 PM

Trapped In A Room With Schwern

I signed on at YAPC to do a talk simply entitled Trapped In A Room With Schwern. Robert Blackwell said this of what he wants:

"I think you have a nice bent/slant/angle etc on a lot of stuff. I want you to get people talking. And no I would not expect every talk to be about perl. It would make me sad if they were all perl. But if you do it I would hope you could get everyone in the room to get excited about something. And I hope you would bore the crap out of others. Why do I think that b/c your audience is everyone from Larry to noob. You are not going to shock both of them or bore both of them with the same stuff."

I have a goldfish's memory for what's interesting to me. As I'm writing up a list of things to talk about, I'm thinking that I'm missing something really obvious that I've simply forgotten about or that seems old and obvious to me.

So, suggestions? What would people like to hear about? Perl or otherwise. What do I tend to babble animatedly about between sessions?

Sunday June 07, 2009
07:20 PM

Unit test your Javascript in 8 seconds!

I posted about my WWW::Selenium + Test.Simple hack yesterday to enable automated Javascript unit testing. One of the problems was it was very slow. It had to start and kill a Firefox instance between each test which takes 8 seconds per test on my machine. Running 7 tests is a full minute.

Solution? Cache the selenium object! This will reuse the same Firefox session between tests so you only get slammed by the startup cost once. Now my 7 tests run in 8 seconds, the time to start up Firefox. That's awesome!

Will reusing the same Firefox process cause a problem? Unlikely. When I test web sites, with or without Selenium, I sure don't restart Firefox between checks. And neither will your users, so this is far more realistic. Web browsers are designed to isolate page requests from one another.

The prototype works. Future directions...

* Roll selenium-server into the distribution.
* Automate starting the selenium server.
* Add a config file...
    * Which browser(s) to use?
    * Which selenium server to use, or start its own?
    * What file extensions to test with selenium?
* Rerun tests across multiple browsers
* Turn the HTML wrapper into a configurable template
* Make it play nice with prove.
    * Turn it into something which can be used with --exec
    * Turn it into something which can be put into .proverc
* Modularize it
* Figure out how to keep the Firefox process from appearing
    * Or at least run backgrounded