Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Alias (5735)

  (email not shown publicly)

Journal of Alias (5735)

Monday February 15, 2010
11:26 PM

Finding an extra 10% tarball compression

[ #40184 ]

One of the downsides of Strawberry Perl's move from the InnoSetup .exe installer to the Microsoft native .msi installer was that we had to switch from LZMA compression to the rather less spectacular MSI native compression (which appears to just be deflate or something similar).

Our headline installer went from 17meg to 32meg overnight.

If you were paying (stupidly) close attention to the latest release, you might have noticed that Curtis managed to drop the installer by 3meg, without changing (at all) the compression mechanism and while adding slightly more content to the package.


Via the curious method of just changing the order in which he added the files to the archive, sorting by file extension instead of sorting by file name.

The grouping (even at a naive level) of similar types of content into the same area of the resulting file provided such a good improvement to dictionary efficiency, that it resulting in nearly a 10% improvement over plain deflate (which is almost as good as switching to bz2).

What would be even more awesome would be combining this change with LZMA as well (which builds dictionaries across much bigger areas of the file).

And if you could do it in something less than O(n^2) time, it might also be interesting to test pairs of files directly, to brute-force discover which file order was most efficient for feeding into the compression routine.

Archive::Tar::Optimize anyone?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Windows Installer XML was sorting by the ID I gave the file. Previously, that was a GUID - with results you can imagine. Now, I put the extension, and then a CRC32, into that ID, so it sorts by the extension now.

    The new Strawberry Perl for Windows has been released! Check for it.
  • This reminds me of the Bad Old Days when you'd fret over whether LZH was better than ARJ or ZIP.