Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Zen of Comprehensive Archive Networks

posted by hfb on 2002.11.12 11:13   Printer-friendly
jhi writes "It seems that there is a lot of interest in having similar archives for other languages like CPAN [1] is for Perl. I should know; over the years people from at least Python, Ruby, and Java communities have approached me or other core CPAN people to ask basically 'How did we do it?'. Very recently I've seen even more interest from some people in the Perl community wanting to actively reach out a helping hand to other communities. This 'missive' tries to describe my thinking and help people wanting to build their own CANs. Since I hope this message will somehow end up reaching the other language communities I will explicitly include URLs that are (hopefully) obvious to Perl people."
[1] http://www.cpan.org/

I'll start negatively and end with hopefully more constructive notes, however these will build on the denials.

In the following Mumble and mumble stand for any other language than Perl or a combination of languages other than Perl.

First, the negative statements.

  1. CPAN shall not 'piggyback' other languages. (There shall not be a mumble/ top level directory.)
    • Rationale: CPAN is CPAN is CPAN. CPAN carries Perl. This implies all kinds of different contracts, explicit and implicit.
    • Some people in the Mumble community will take offense to CPAN carrying Mumble.
    • Some people in the Perl community will take offense to CPAN carrying Mumble.
    • Some CPAN mirrors will take offense to suddenly having to carry also Mumble.
    • Some CPAN mirrors will become resource (bandwidth, disk) constrained after having to suddenly carry also Mumble.
  2. CPAN cannot 'piggyback' other languages.
    • The building blocks or 'plumbing' of CPAN (the basic directory structure, the PAUSE) is a reasonably good match for Perl. I'm not so certain that it is for all the other languages.

Now, on to the hopefully more constructive suggestions.

First and foremost-- I'm not against other language communities having a CPAN. I would love to have such archives. I'm willing to help the other language communities. I'm only against too straightforward "let's just slap it on to the CPAN" solutions to the problem. Other languages are not like Perl, they are different, to a smaller or larger degree. Let's allow them their own degree of dignity and careful thought.

Then on to the technical questions, a.k.a. "How did you do it?" Well, people always ask that from me and I go speechless... "Errrr, ummm, I kind of pulled all this stuff together and organized it a bit, and put it on a ftp server". After this a brooding silence always falls... "And...?" ... "And what?" ... "That's it?" "That's it."

Well, that's not really it, of course. The above is how CPAN started. How it grew is another story. First, Larry designed Perl to grow by letting it have modules (in other words, namespaces). Then we had a couple of wise men (like Tim Bunce) to have the vision of good module naming guidelines. Finally, we had Andreas König who single-handedly wrote PAUSE [2], the module submission machinery, where Perl module authors can register, submit, and manage their submissions. This allowed for a rapid but still controlled growth of modules. Because of the growth, it finally became too arduous to know what was out there, and luckily Graham Barr's scratch to this itch become large enough to be published as search.cpan.org [3]. Later backPAN [4] was added by Andreas to hold all the old versions of submissions deleted by their authors; this ties back into simple basic things that the master server(s) must have, like good backups. Last but not least, feedback for the modules can given through the RT ticketing system set up be Jesse Vincent.

[2] http://pause.cpan.org/ (or https://pause.cpan.org/) [3] http://search.cpan.org/ [4] http://history.perl.org/backpan/ [5] http://rt.cpan.org/

CPAN mirrors [6], then? How did they come about? The original ones, dozen or so, were easy: I just asked the maintainers of the original ftp sites I had found the seeds of CPAN from whether they might be interested in carrying this slightly bigger amalgamated Perl archive. Well, they foolishly agreed... I have to remind people once again that CPAN was conceived as a FTP archive. Not a website. And it still is that way. search.cpan.org just gives a nice interface. I'm sorry but I'm a dry CS engineer, not a graphic designer. Information, not animation.

[6] http://mirrors.cpan.org/

Oh, back to the CPAN mirrors. After the original ones, we grew slowly for a while, by word of mouth in the Perl community. However, since this was the time before the billions dollars worth fiber dug into the ground, Internet connections were still a bit dodgy and spotty. Therefore I started doing two things: scanning ftp logs for sites that obviously were mirroring CPAN but were not registered mirrors, and sites that were good representatives for their particular top level domain, especially outside the big seven TLD. This way I could track down where Perl was used and by asking those sites to participate to push back the load from the master site. Later I also filled in missing countries by going for sites like the sunsites, and other vendor/public funded sites that had a good chance of having good connectivity. Usually I could find a sympathetic soul, oftentimes a system administrator.

Summary of the mirror tirade: I went for sites that liked and/or used Perl. I have no way of knowing off-hand whether they would like Mumble. The mirrors are donating their network and storage capacity and some amount of their administrative time for the Perl community. If we would like to extend that in any way we would have to ask them, from all of them individually.

You can learn more about CPAN's history from the Perl timeline [7]. Things didn't happen overnight.

[7] http://history.perl.org/PerlTimeline.html

A quite important thing for both the authors and the users is that the language must get the naming scheme of its modules right, or at least reasonably close. Perl's/CPAN's is far from perfect, but at least it was once designed, and it has been enhanced over the years as new needs have appeared. A good naming scheme allows hierarchical browsing, gives good hints for search engines (a good name is effectively a string of uniquely identifying keywords), and coordinates community efforts. Some sort of conflict resolution mechanism in case of competing and identically named implementations is important. Keeping all those guidelines well documented and all these processes public is important. One naming issue I think Perl 5 got wrong is that module namespaces are first-come-first-served, two or more different authors cannot have an identically named module. This may lead into unintentional or intentional squatting, which is not good for the community.

When designing your author/module/whatever hierarchy think scalability. We originally got it wrong by having all authors as subdirectories in one single directory which quickly became a bottleneck. (The solution to this was simply to 'hash' based on the leading two characters of the user ids.) Think also several different views to your data: by author, by module, by category, by date, by keywords, and so forth. Don't think only hierarchical views will be enough: you will need searching capabilities.

Get your license policy clear from the day one. No, day minus one. In this day and age it is very important that every piece of software gets clearly marked as to what license it carries. Build your module packaging tools so that they suggest, maybe even demand that the author picks a license. This way both the users of modules and distributors of software wanting to include the module don't have to keep guessing.

Very much related to the licensing is of course commercial use: CPAN took the easy and clear policy of no commercial software of any kind, not even share/guilt/donateware would be allowed. We felt that any other policy would be open to nitpicking, or maybe even legal challenges, and as a volunteer ragtag group we had no time or other resources for any of such.

Security? Should you have PGP keys and triply-written-in-blood signatures? Maybe. Currently CPAN has only MD5 checksums-- but so far they have been enough. There are some ongoing projects that enable using PGP keys for verifying the origin of the software; but as always with PKI systems, bootstrapping the web of trust is hard, some say even not worth the trouble.

Code quality? Ratings/reviews? Moderation/metamoderation? "Approved" SDKs? These all are hotly debated subjects and will not be addressed here since the CPAN is and will stay an open and free forum, where the authors decide what they upload. Any further selection belongs to different fora.

The scripts that maintain the CPAN are dreadfully simple. They are just simple shell scripts that copy sites A, B, ..., Z to the CPAN master site at ftp.funet.fi, launched from cron. Many of them use Ye Olde Original mirror.pl, some of them are just rsyncs. No magic. I really don't have anything to give away, no magic bags full of powerful CPAN spells. The most complex script I think I have is the script that probes the mirror sites for uptodateness-- and even that is not rocket science, just multiplexing ftp and http downloads and comparing timestamps. If someone wants that code, they can have it.

Andreas has the webserver code for PAUSE available online. That code is slightly more complex than the core CPAN scripts, or the scripts supporting the PAUSE; but even here, the code is there. No tricks up our sleeves.

There is no magic. All it takes is a few people that sit down and get first something running, a rough cut. Then iteratively enhance it. Perhaps the most demanding thing is commitment: someone must keep things running. A slowly decaying and dusty archive is almost worse (and certainly more sad) than no archive at all.

Oook and out.

--

Jarkko Hietaniemi, the CPAN Master Librarian

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Don't forget CPAN.pm. That's the other end of the stick that lets people access the stuff inside CPAN in the most incredibly useful and lazy fashion.

    I honestly feel that without CPAN.pm, CPAN as a whole would not be as popular as it is today. Look at something like HTML::Mason, which has a half dozen dependencies. CPAN.pm Just Takes Care Of It[tm].

    -Dom

    P.S. I know that CPAN.pm has many flaws, but it makes up for them by 1) being useful and 2) being installed by default with perl.

    • Oops, a very good point, thanks. But since Andreas himself didn't notice that omission in his proofreading of the article I don't feel that bad about forgetting it... :-) But yes, an automatic network-aware installation mechanism, with the dependency resolving and checksum checking, is a very important step.

      I guess I'll start maintaining the master copy of
      this article somewhere at CPAN, once the feedback settles down.

    • While CPAN.pm is important (as is search.cpan.org, and the DNS magic Ask has set up) they're really less important than many people think.

      It's not been that long since I managed systems where CPAN.pm was completely unusable, and I had to do it all by hand--FTP, make, and the like. No automatic dependency checking, no fetching, no module lists, nothing. (Plus it was five miles uphill in the snow to the nearest mirror!) The only thing available was the base mirror functionality. And with that... CPAN was phe
      • CPAN.pm is what makes it possible for people who aren't admins, don't know how to become admins, and don't want to be admins to install perl modules.

        There may be some modules that don't install cleanly, or have strange external dependancies that they don't make clear ... but those are special circumstances of the modules. The bottom line is: if I write a pure perl module, and someone wants to use it, they can install it without needing to know anything other then perl. As far as I'm concerned that's the
        • I'm not knocking CPAN clients like CPAN.pm or CPANPLUS. I like 'em both, and I'm glad they're around, but... they're not what's made CPAN a success. They built on the success that CPAN had. Yes, they moved CPAN to a new level, and that was good, but the base mirror net and its infrastructure is what's made this all possible. The rest are (nearly inevitable) conveniences, bells, and whistles--often massively useful, but ultimately optional.
          • Actually, I'd say the one thing that really 'made' CPAN was search.cpan....once it caught on it made CPAN accessible to a much wider audience which is why it is so often confused for the archive itself. CPAN was a success just by existing at a time when you had to ftp to 15 different sites just to get the kit you wanted for your systems. CPAN.pm made it convenient and search.cpan made it navigable and less intimidating for those a lot less familiar with CPAN. WAIT and UWinnipeg had been around for at least

    • Besides CPAN.pm, you should not forget MakeMaker, that has simplified the process of building, installing, testing and packaging a module, portably, and consistently. People started to install modules from CPAN because they were easy to install ; they started to upload modules to CPAN because it was easy to produce an installable tarball.
      • Agreed. MakeMaker is an important part of this process. In fact, I hold high hopes for python modules now that they have had distutils.py for a year or more. It'll make building an archive for them much easier. I think ruby has something similiar as well, but I'm not sure...

        -Dom

  • When I was working for ActiveState [activestate.com], I got to observe other language communities try (and try, and try) to duplicate CPAN. They failed with depressing regularlity by making it overcomplicated, or centralizing the work too much. Decentralize! If you want a community-based system, make the community do as much of the work as possible. No bottlenecks. The one centralized thing in PAUSE->CPAN is a mailing list which approves some changes in the naming hierarchy. This usually works ok but even now some peop
    • I actually think that Java got it right when they used domain names as part of the namespace. And really, I agree with Tim B-L when he says that we should be using URIs to identify such things.

      Mind you, I don't want to go down the route of SGML catalog files. That's too much like hard work.

      -Dom

      • I think the problem with using domain names is using domain names... that is, you make an implicit assumption that everybody

        • (a) is in possession of a domain name that they want to use and can use for tagging their software with
        • (b) tagging stuff with domain names may mean tagging stuff with trademarks and other legal stuff which may turn out to be a burden later if the software needs to be renamed for any reason.

        Other way of putting it is that using domainnames works okay-ish for stabl-ish organizati

  • When I was working for ActiveState [activestate.com], I got to observe other language communities try (and try, and try) to duplicate CPAN.

    They failed with depressing regularlity by making it overcomplicated, or centralizing the work too much.

    Decentralize!

    If you want a community-based system, make the community do as much of the work as possible. No bottlenecks. The one centralized thing in PAUSE->CPAN is a mailing list which approves some changes in the naming hierarchy. This usually works ok but even now some peo

  • Just though I would add how much I like the RT system for CPAN. It's of great use to people like me on minority platforms (Solaris ;-).

    Pity the search.cpan.org web interface changed about a week after RT was released. Still the blue and white is nostalgic.

    One thing though - looks like ther SSL certificate for rt.cpan.org has expired on November 7th.

    Gavin
  • At work we use CPAN.pm to install our own non CPAN modules. Essentially this involves examining the modules building compatible index files with a simple script...

    I'd love to have a look at the source code for PAUSE and the other systems. Is it on the CPAN somewhere, or somewhere else online?

  • The Comprehensive TeX Archive Network [ctan.org]. For some reason I've always thought that CTAN preceeded CPAN, but I'm not really sure which one was there first. Like CPAN, CTAN was conceived as a FTP-based service and then the web came and people moved on and you know the rest. Since I use both CTAN and CPAN on a regular basis, sometimes I find myself wishing CTAN to be more CPAN-like. The CTAN Catalog is superb, but I think the killer CPAN feature is the ability to browse the documentation in a nice easy to rea

    • CTAN was there first, and we freely acknowledge our debt and inspiration in naming :-)
      CPAN is "only" seven years old, while CTAN is, gee, older than that. I can't off-hand find out how old CTAN is.
      • We (George Greewade in the US, Sebastian Rahtz in UK, Rainer Schöpf and myself in Germany) built CTAN in 1992. It was "officially" announced at the EuroTeX conference in Aston, 1993.

        CTAN was an effort to bring together the separating ftp servers with TeX material. I'm proud to say that it was triggered by a podium discussion I organized at the EuroTeX conference 1991, in Paris. George came up with the name CTAN, I think I have his email still somewhere in my archives. I got involved since I ran one o

        • Thanks!

          (This history somewhere in the CTAN website would be neat.)

          > CTAN was an effort to bring together the separating ftp servers with TeX material.

          Sounds so very familiar...

          > and had heavily modified mirror.pl from Lee for this purpose. :-)

          If you CTAN guys would have any comments and/or suggestions to give for the "ZCAN" article I would be more than happy to incorporate them.

    • Effort in CTAN package documentation actually goes into TeXlive. The new distribution has most of the documentation in PDF. HTML is not practical, since most of the documentation will demonstrate some layout example.

      Maybe we'll find sometime the volunteers to transport this effort back to CTAN.

      Actually, there's a lot in CPAN we'd like to have in CTAN as well, and never got around it. Most important, something similar to PAUSE, and commonly agreed upon package structures.

      Sigh, so much to do, so few tim