Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • How big do you expect the final corpus to be? If it will be able to fit in memory, it is honestly fastest to just implement an internal hash, build that, and dump it. If it won't fit in memory with Perl using it, then an in memory database (SQLite can do this) is your best bet.

    If it is much bigger than will fit in memory, though, then you should go for a radically different approach. What you should try to do is do a mergesort on disk. Odds are you won't even need to write this yourself - create a datas
  • 1) create a dictionary for the words in the file assigning an integer to every different word

    2) map the text file into an array (@word) where every word is replaced by its index.

    3) create another array (@offset) containing 0..$#words

    4) sort @offset as follows:
    @offset = sort {
    for (my $i = 0;;$i++) {
    return -1 if $a + $i >= @word;
    return 1 if $b + $i >= @word;
    return ($word[$a+$i] $word[$b+$i] or next)
    } } @offset;

    5) now the offsets into