Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

acme (189)

acme
  (email not shown publicly)
http://www.astray.com/

Leon Brocard (aka acme) is an orange-loving Perl eurohacker with many varied contributions to the Perl community, including the GraphViz module on the CPAN. YAPC::Europe was all his fault. He is still looking for a Perl Monger group he can start which begins with the letter 'D'.

Journal of acme (189)

Friday March 07, 2003
05:52 AM

CPAN Size

[ #10933 ]
CPAN is 1.5G. That's how much diskspace will be taken up if you host a CPAN mirror. This morning I uncompressed every .tar.gz and .zip on CPAN for a laugh. The resulting, uncompressed mass of directories takes up 1.7G. Looks like compression isn't helping a great deal and tar and zip are mostly only good for packaging purposes.

Mmm, more stats...

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I'm amazed by your finding that an uncompressed CPAN is only 13% larger than the compressed version. I would have thought that anything text based like a Perl module should compress very well, even with ZIP or tar.gz.

    I wonder what is taking all the space up and is uncompressible?

    I know in the cygwin [cygwin.com] world bzip2 [redhat.com] is very popular, and I've wondered if going forward it would be useful for CPAN or future CPAN to support it as well, to squeeze a little more compression in.

    --
    -- "It's not magic, it's work..."
    • thousands of tiny little files...

    • Elaine's right, and bz2 won't help here. Maybe you'd squeeze things down by another 1%. Maybe. There's a lot of overhead to small files--modern compression programs work better the larger their input, and perl modules just aren't that big. There's also a lot of uncompressable overhead in the tar file structure information.

      If you wanted to compress perl modules better, you'd want a denser file packing scheme than tar, and build a compression scheme that was prepopulated with a lot of the common perl substri
      • While I agree that bz2 or someother compressor isn't going to fix the problem, I do find that on a tar of text files, it's quite a bit more than 1% efficient than gzip.

        I can't comment on replacing the tar structure, but I've seen comments on it's weaknesses in other places too.

        I'm still amazed at how little compression there is in CPAN, the latest module I've uploaded for example, shrank from 90kb to 24kb with gzip (22kb with bz2). What is in there that doesn't compress?

        --
        -- "It's not magic, it's work..."
        • Look at the size of many of the files on CPAN. I don't have the space to slurp the whole thing down for analysis, but a quick scan through shows that a huge number of the archives are tiny--less than 15K. Lots of them are less than 10K. Thats of a size where compressors just don't have enough to work with to make much of a difference, so it doesn't matter what compressor you're using, as there isn't enough there to compress at all usefully.

          It's not that the data on CPAN is oddly uncompressible. It's that t
  • To be fair... (Score:3, Informative)

    by belg4mit (967) on 2003.03.07 12:19 (#17797) Homepage Journal
    A module's source is not the size of a module.
    In particular, the size of a moderately complicated binary (XS module) is significantly larger than the source.

    Also, what if you only take the latest (or latest two) versions of any given module? A lot of authors haven't heard that BackPAN exists, and that the Master Librarian would like to see things under 700 MB.
    --
    Were that I say, pancakes?
  • Top 10:
    124516  G/GR/GRAHAMC
    82364   J/JH/JHI
    69932   G/GS/GSAR
    63344   C/CN/CNANDOR
    31616   N/NI/NI-S
    31588   I/IL/ILYAZ
    28928   K/KR/KRISHPL
    25244   T/TI/TIMB
    24788   L/LD/LDS
    20228   B/BI/BIRNEY
    All of the above, though, have perl distributions (or documentation distributions).