Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Journal of ambs (3914)

Thursday June 12, 2008
11:57 AM

Storing word-grams

[ #36674 ]

I am in the way of storing word-grams for big texts (read big = more than 3GB text files). I want 2-word, 3-word and 4-word tuples, and respective occurrence count.

When processing these texts (on a cluster) I do not have access to any RDBM system. Well, I have SQLite, Berkeley DB, GDBM and probably other similars that I am forgeting about.

As you might guess, the main problem with this is populating the database. For each word on the corpus I need to check if it (together with the neighbourhood) exists or not in the database. If it does, I increment the counter. If not, I add a new entry.

Given that I am working on a Cluster I can easily split the job in different chunks, so that each node process a different part of the text. At the end I just need to glue the final databases.

In my experiences SQLite seems to be faster tool for this task. But I may be wrong.

So, what would you use for that?

(I know that for questions PerlMonks might be better, but I just think that site is completly unusable :( )

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • How big do you expect the final corpus to be? If it will be able to fit in memory, it is honestly fastest to just implement an internal hash, build that, and dump it. If it won't fit in memory with Perl using it, then an in memory database (SQLite can do this) is your best bet.

    If it is much bigger than will fit in memory, though, then you should go for a radically different approach. What you should try to do is do a mergesort on disk. Odds are you won't even need to write this yourself - create a datas
  • 1) create a dictionary for the words in the file assigning an integer to every different word

    2) map the text file into an array (@word) where every word is replaced by its index.

    3) create another array (@offset) containing 0..$#words

    4) sort @offset as follows:
    @offset = sort {
    for (my $i = 0;;$i++) {
    return -1 if $a + $i >= @word;
    return 1 if $b + $i >= @word;
    return ($word[$a+$i] $word[$b+$i] or next)
    } } @offset;

    5) now the offsets into