Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Alias (5735)

Alias
  (email not shown publicly)
http://ali.as/

Journal of Alias (5735)

Thursday March 08, 2007
08:17 AM

Want to help with the development of CPAN?

[ #32616 ]

Last Sydney.pm someone suggested that I post up a "top 10 things I could use some help with" on my website.

It's an interesting idea, but one that would be tricky.

You see, the most important things that need to be done often block on political and communication issues.

As evidence of this, just look at how much work gets done at the big hackfests, simply because all or most of the major players are in the same room.

Communications, politics and management also have the tendency to have long-term effects. As an example, look at the current less-than-optimal state of Template Toolkit. Because Andy has been tied up with other things and TT doesn't have a succession plan for release management, TT users have in some cases been left stranded (TT hasn't installed on Win32 for a year now) despite members of the community generating patches and being willing to deal with the problems.

This is the Continuity or Death theme again.

But I digress.

While I'm not sure I have a list of 10 things that are all politics free, I certainly have one big highlight. One project you could write or help write that, I feel, would be hugely important to the future improvement of the CPAN.

And it goes something like this... (shimmer out to dream/animation sequence)

------------------------------------------------------------------------

The CPAN Open Data API

Module interrelations on the CPAN are now too complex for them to be maintained and managed by wetware alone.

There exists a number of important issues across the graph of module dependencies, such as bitrot and back-compatibility, and so on.

Most of these issues can or could be expressed programatically. Metrics could easily be developed for many of these issues.

The data required to develop these metrics is spread out over many different CPAN services, and is currently unapproachable.

Over the course of the year I hope to see every CPAN-related service exporting database dumps, most likely in the form of SQLite database. A number do this already. I'd like to see all of them doing so soon.

Access to the data is a different issue to being able to exploit the data.

With this in mind, I'd like to see the following.

1. A unified schema that ties together all CPAN-related datasets, implemented using (most likely) SQLite.

2. A pre-built CPAN module that would provide a convenient ORM layer over the top of this schema. This would most likely be done using something like DBIx::Class.

3. A second CPAN module that would pull data from the CPAN index and every other CPAN-related system that publishes datasets, and munge them together to create the SQLite database containing all the data.

------------------------------------------------------------------------------

You can see the beginnings of my attempt at this at http://svn.phase-n.com/svn/cpan/trunk/CPAN-Index/.

Unfortunately, I'm a DBIx::Class newbie and I really haven't the time to work on this anywhere near as much as I would like.

Why is this so important?

By having these three elements in place, it enables not just access to the CPAN data, but CONVENIENT and PROGRAMMATIC access to it.

It means that anybody that would like to develop a metric for, say, bitrot (using age of the last release and the number of bugs reported and such) can do so fairly easily.

And from there it would be fairly trivial to throw a Catalyst of ttree website on top of those metrics to create interesting new services.

Imagine a sort of "CPAN's 10 Most Rotten Modules" website which lists the modules that are both the most rotten AND have the most other modules depending on them.

These sorts of tools would allow us to focus maintenance efforts where they are most needed, which is something that will be increasingly important as we head for 20,000 modules.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • This is something I'd like to work on, as soon as I have a lot of free time. My interest is getting at the data and populating the databases.
  • This is exactly the kind of project I was anticipating with my very first journal entry [perl.org].

    I would love to help out with this. Unfortunately, I cannot promise much in the way of tuits in the near future. However, at very least, expect me to publish some thoughts on the design aspects of this after I dust them off and edit them.

  • At GPW, I discovered the quite usefull module DBD::PgLite::MirrorPgToSQLite. Using this, it's totaly easy to generate a SQLite dump of the CPANTS DB.

    As soon as I find some time to finish the setup on the new (yet again) server, you can expect a daily dump of all CPANTS data to sqlite.

    WRT to DBIx::Class: I'm already using it to access the CPANTS DB, and could probably help out with creating some of the tools you're talking about. BUT: I have a big project to finish until 2nd April, plus my day job, plus YAPC
  • I'm interested to help with this as well. Last night I ran into a use case:

    I was thinking about packaging up a script with the whole dependency chain so it could be easily installed. CPAN-Index could me query to see exactly what the dependency chain is.

          Mark