Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • How big do you expect the final corpus to be? If it will be able to fit in memory, it is honestly fastest to just implement an internal hash, build that, and dump it. If it won't fit in memory with Perl using it, then an in memory database (SQLite can do this) is your best bet.

    If it is much bigger than will fit in memory, though, then you should go for a radically different approach. What you should try to do is do a mergesort on disk. Odds are you won't even need to write this yourself - create a dataset where each line is a tuple. Then run the Unix utility sort on it, then postprocess that dataset in memory. See how fast it is. Odds are with your dataset that this approach will be fast enough on a single PC for the work you are doing.

    If you need to write it yourself, you can beat sort, but only by cheating and taking heavy use of the fact that whenever 2 entries match you can merge them and add the counts. This will reduce your dataset size considerably. I've done it, and even been smart about when I had to write to disk and when I didn't. I wouldn't recommend going that way unless you have to.

    If you need it to wind up in a database which supports quick lookups, then you can create a btree with BerkeleyDB and put it in there after you have the sorted dataset. The rule here is that putting sorted data into a btree is fast. Putting unsorted data in is slow, and a hash backed by disk is slow.

    If you want to dig deeper and understand why, you need to consider a hypothetical drive that goes at 5000 RPM which supports a sustained throughput of, say, 20 MB per second. For a random seek to disk the drive, on average, has to spin half-way round. It is spinning 5000 times per minute so it will do about 10,000 seeks per minute. With a naive algorithm you try to fetch each record, then you update it and try to insert each record. So that's about 5000 records per second that you're processing.

    Suppose your records (which in your case are 2-4 words long) average, say, 40 bytes. The disk can pass over 20 MB/s * 60 s = 1200 MB of data in a minute. That's 30 million records. Admittedly merge sorts need multiple passes to finish sorting. To make the numbers work out, let's say that multiple in this case is 30 (that's 15 reading passes, and 15 writing passes. Which with datasets of this size probably means a poorly implemented merge sort). Using merge sort meant you handle a million records a minute versus 5000.

    The moral? Once data lives on disk, you do not want to randomly seek to it. You really, really, really do not want to do that.