I am in the way of storing word-grams for big texts (read big = more than 3GB text files). I want 2-word, 3-word and 4-word tuples, and respective occurrence count.
When processing these texts (on a cluster) I do not have access to any RDBM system. Well, I have SQLite, Berkeley DB, GDBM and probably other similars that I am forgeting about.
As you might guess, the main problem with this is populating the database. For each word on the corpus I need to check if it (together with the neighbourhood) exists or not in the database. If it does, I increment the counter. If not, I add a new entry.
Given that I am working on a Cluster I can easily split the job in different chunks, so that each node process a different part of the text. At the end I just need to glue the final databases.
In my experiences SQLite seems to be faster tool for this task. But I may be wrong.
So, what would you use for that?
(I know that for questions PerlMonks might be better, but I just think that site is completly unusable
Here are some possibilities for you (Score:1)
If it is much bigger than will fit in memory, though, then you should go for a radically different approach. What you should try to do is do a mergesort on disk. Odds are you won't even need to write this yourself - create a datas
another approach (Score:1)
1) create a dictionary for the words in the file assigning an integer to every different word
2) map the text file into an array (@word) where every word is replaced by its index.
3) create another array (@offset) containing 0..$#words
4) sort @offset as follows:
@offset = sort {
for (my $i = 0;;$i++) {
return -1 if $a + $i >= @word;
return 1 if $b + $i >= @word;
return ($word[$a+$i] $word[$b+$i] or next)
} } @offset;
5) now the offsets into