Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Saturday January 18, 2003
06:45 PM

The Zen of File Names

[ #10061 ]

A lot of people send files to me, and I send a lot of files to other people. These files, no matter which way they go, probably end up sitting on a hard drive until the person who has them forgets who sent them, what they contain, and why they should keep them around.

I have several files named "article.pod" lying around since potential authors for The Perl Review send me their articles. A couple of other filenames are popular too ("tpr.pod", and so on).

People tend to choose file names that are meaningful to them and distinguishes the file from all of the other files they have in their field of vision at any particular time (for instance, in a single directory listing). These files often keep the same names once they leave the creator and end up in a new context where their special name most likely loses all of its meaning, or takes on a new and unintended meaning.

Consider a document sitting on my desktop for several months---"randal_tagmemics.pdf". Every time I look at that I think of Randal Randal first, then remember it is actually Allison Randal. Not only does Perl have two Randals, but they both live in Portland.

The documents that I create have similar problems. When I send an article into a magazine, I name it with my initials, the magazine's name, and the date of publication. That seems to work fine---no one has complained---but I forget what I wrote when, espeically when I write the article a couple of months before publication. I end up naming it by something in the proposed title on my side.

Some people religiously stick to a particular directory structure (and the number of standard directory structures is large enough to accomodate everyone's preferences, apparently), and stick files in the right directory so that the directory path provides the context for the file. Other people assume that the directory is meaningless, or might lose its meaning, so put as much information in the filename as possible. The same file would appear as either client/netscape/index.html vs client-name-top-level-index-for-netscape.html in those approaches, respectively.

One operating system makes you go the other way because you can only choose 8 letters for the name. How would I name "randal_tagmemics.pdf" on DOS? I could choose "tagmemic.pdf", but what if I want to download related documents? Those might choose the same name because the namespace is so small, or choose completely unintelligable namespaces like "ARTGMC10.pdf". I get to work with a lot of Microsoft Word documents created by a big, government agency and almost all of them suffer from this. A big Perl training company I know does this, so that something like AP491SE.pdf means "Advanced Perl version 4.9.1 Short Course with exercises" just so it downloads nicely on to Windows, which is what most of the instructors get to use with the provided projectors. Another client has a lot of files that look like "SALESABC.PPT" where "ABC" is some incantation of a company name, usually related to its stock symbol somehow, but in a special way unknown to those outside the organization. They send them to other people without changing the name, regardless.

Some applications get around these file name limitiations with "Get Document Info" features that let the creator tag the file with a lot of meta-information, but you have to not only know which program to use to see that but actually have that program. Nobody has a standard way to do this, and it is probably on purpose (as what economists call "lock-in").

One of the CGI frequently asked questions deals with naming file downloads correctly. Most browsers pick the file name portion of the URI, say, "download.cgi", rather than anything the HTTP response might say. People end up with a bunch of files starting with "download.cgi", or they end up with one file that has the contents of the last download.

Many things try to restrict files based on their extension. In a graphical user interface most people want to click a file and have the right thing happen. Most of these extensions have 3 letters (17,576 combinations, 89 of which appear in my web server's mime.types file), some add digits (46,656 combinations), and unixy things might have one or two (foo.Z, foo.gz, ls.1, libpng.a, perl.so). More sane things do not have extension limits (.html, .plist). Some combine two or more extensions (.tgz). Some things have more than one extension (.mov, .moov, .htm, .html) that mean the same thing. One must attain guru status to know which things go with which programs beyond the doc, ppt, xsl triumverate. And, as if that is not bad enough, Windows and Mac OS X will hide the extensions if you tell them too.

Java, and maybe things that came before it, had the dim idea of prefixing everything with internet domains. Java class names end up being really long, like 'com.example.local.string', or some such thing. Most of the Mac OS X preferences files from Apple start with 'com.apple.', like 'com.apple.iTunes.plist', for instance. This is even worse than it looks for those who control a domain like perl.org. Who controls who gets to use that domain as part of a class name (like in the PerlJava thingys)? How does the class name relate to a website with the host name that a file uses? What do you do if you do not have a stable domain name, or you lose your domain name? How does backward compatibility suffer? Are there legal implications?

In unixish file systems, a chunk of data can have more than one file name---that is why we unlink files rather than remove them. We just remove the name to data association, and the data disappears when there are no links to it. And then, some names do not even belong to a chunk of data---they are just nicknames for other names. Most operating systems seem to support some idea of symlink or alias which stick around even when whatever we meant it to reference has disappeared.

Some things end up nameless after the disk, which they thought was their home, is formatted. The contents of the files do not disappear, but a record of their existence does. Even mere mortals can recover these Files Without a Directory with off-the-shelf software.

Some files remain hidden on purpose. In unix, filenames that begin with a dot (.) do not show up in listings or globs unless we specifically tell the command to pay attention to them. On Mac OS X, a unix/Mac hybrid, not only are the hidden file hidden in the Aqua user interface, but all sorts of other files are hidden too---the ones that unix weenies like to play with.