Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

Help Regenerate the Phalanx 100

posted by pudge on 2004.12.21 0:46   Printer-friendly
Andy Lester writes "The Phalanx 100 is a list of the 'top 100' modules on CPAN, and by extension, those that should have the most attention paid to them by the Phalanx project.

The first time I generated the P100 was over a year ago, and things are old and stale. Distributions have changed names (CGI::Kwiki is now Kwiki, for example). Some distros have come and some have gone. It's time to be updated.

This time, YOU can help determine the P100."

The source data, generated from logs from the main CPAN mirror at pair.com, is available for download at http://petdance.com/random/cpan-gets.gz. Write code that analyzes the data, and generates the top 100 modules.

What should your code do? It's up to you! Publish the code somewhere (use.perl.org, perlmonks, whatever) and let me see it. I'm not sure if I'll take someone's decisions directly, or use ideas, or how I'll do it, but the more working code I have to pick from, the better.

Also, the last time I created a P100, I omitted any modules that were in the core distribution. This time, I do want to include core modules, although I do want to have them noted somehow. Richard Clamp's C will be a great help with this.

Whatever you do, however you do it, I need to know about your code no later than January 10th, 2005. Email me at C. There's going to be an article about the Phalanx project going up on perl.com soon after that, and I need to have an updated version of the P100 up (replacing http://qa.perl.org/phalanx/distros.html) by then.

About the data
I used the following code to analyze data from the Apache logs for the main CPAN mirror at Pair.com from November 1 to December 15th, 2004.

    #!/usr/bin/perl

    use strict;
    use warnings;

    my %id;
    my $next_id = 10000;

    while (<>) {
        next unless m!^\S+ (\S+) .+ "GET ([^"]+) HTTP/\d\.\d" 200!;
        my ($ip,$path) = ($1,$2);

        study $path;

        # Skip directories
        next if $path =~ /\/$/;             # Directory
        next if $path =~ /\/\?/;            # Directory with sort parms

        # Skip certain directories
        next if $path =~ /^\/(icons|misc|ports|src)\//;

        # Skip certain file extensions
        next if $path =~ /\.(rss|html|meta|readme)$/;

        # Skip CPAN & distro maintenance stuff
        next if $path =~ /CHECKSUMS$/;
        next if $path =~ /MIRRORING/;

        # Module list stuff
        next if $path =~ /\Q00whois./;
        next if $path =~ /\Q01mailrc./;
        next if $path =~ /\Q02packages.details/;
        next if $path =~ /\Q03modlist./;

        my $id = ($id{$ip} ||= ++$next_id);

        print "$id $path\n";
    }
This gives lines like this:
    16395 /authors/id/K/KE/KESTER/WWW-Yahoo-DrivingDirections-0.07.tar.gz
    10001 /authors/id/K/KW/KWOOLERY/Buzznet-API-0.01.tar.gz
    85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.01.tar.gz
    85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.02.tar.gz
    85576 /authors/id/J/JR/JROGERS/Net-Telnet-3.03.tar.gz
The 5-digit number is an ID number for a given IP address. I found that some IPs were routinely slurping down entire histories of modules, which probably will skew statistics to those with a lot of revisions.

How should these be accounted for in the analysis? I don't know. That's one of the reasons that I put this out for all to work on.

I welcome your comments, suggestions and help on this.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I've added a timestamp and the user agent to the log format. I hope that's not too confusing.
    --

    --
    xoa

  • Richard Clamp's C will be a great help with this. [...] Email me at C.

    Presumably a result of POD markup. At first I was wondering why Perl couldn't do the job perfectly well without help from C, whether Richard was writing it or not.

  • Shouldn't you for clarity also list the recursive prereqs of the P100? I mean, you *are* implicitly testing the prereqs, too.

  • Personally, I'd like to see rankings based on cpanratings and/or kwalitee, if the components of kwalitee could be assigned some objective values.
    • I think you are confused about the intent of the P100 project. (Then again, *I* might be, wouldn't be the first time...)

      The intent of P100 is *not* to select the top 100 "best quality" or "coolest" modules. It is about selecting the most used (in other words, most important) modules and guaranteeing their quality by testing. So what is the quality of the modules as of now is in a twisted way kind of irrelevant - what matters is that people use those and THEREFORE it would be good that the modules are O
    • The point of Phalanx is to have the most-used modules have a solid test suite so that those 100 modules can be used as a test bed for Ponie. If DBI is the most-used module, then it's crucial that Ponie work with it. The more tests that are in DBI, the more testing of Ponie there is as well.
      --

      --
      xoa

  • Careful: core modules won't even appear in the download stats of CPAN sites. You want the most often used modules, not the most downloaded ones.

    And what about modules that one particular person uses 5 times more often in scripts, than another one? Is it more important? I would think so.

    • Core modules "will be handled in a different phase of the project." But they do show up on dependency scans. Quite a bit, actually. Preliminary scans show the three highest direct dependencies are Test::More (500+), Carp (100+) and File::Spec (100+).
  • There's a thread with code and output that may be of interest. http://www.perlmonks.org/?node_id=416363 [perlmonks.org]
    --

    --
    xoa