NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.
All the Perl that's Practical to Extract and Report
Stories, comments, journals, and other submissions on use Perl; are Copyright 1998-2006, their respective owners.
another approach (Score:1)
1) create a dictionary for the words in the file assigning an integer to every different word
2) map the text file into an array (@word) where every word is replaced by its index.
3) create another array (@offset) containing 0..$#words
4) sort @offset as follows:
@offset = sort {
for (my $i = 0;;$i++) {
return -1 if $a + $i >= @word;
return 1 if $b + $i >= @word;
return ($word[$a+$i] $word[$b+$i] or next)
} } @offset;
5) now the offsets into @word contained inside @offset are sorted so that similar sequences appear consecutively. You only have to scan @offset and count the consecutive word repetitions
For instance:
1) ...
2) @words = (a b b c d a b c)
3) @offset = (0 1 2 3 4 5 6 7)
4) @offset = (
0 # => a b b c d a b c
5 # => a b c
1 # => b b c d a b c
6 # => b c
2 # => b c d a b c
7 # => c
3 # => c d a b c
4 # => d a b c
)
5)
a b b c d a b c
a b c
=> a b = 2
b b c d a b c
=> b b = 1
b c
b c d a b c
=> b c = 2
c
=> c = 1
c d a b c
=> c d = 1
d a b c
=> d a = 1
Reply to This