Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • by salva (841) on 2008.06.13 4:38 (#63354) Journal

    1) create a dictionary for the words in the file assigning an integer to every different word

    2) map the text file into an array (@word) where every word is replaced by its index.

    3) create another array (@offset) containing 0..$#words

    4) sort @offset as follows:
    @offset = sort {
    for (my $i = 0;;$i++) {
    return -1 if $a + $i >= @word;
    return 1 if $b + $i >= @word;
    return ($word[$a+$i] $word[$b+$i] or next)
    } } @offset;

    5) now the offsets into @word contained inside @offset are sorted so that similar sequences appear consecutively. You only have to scan @offset and count the consecutive word repetitions

    For instance:

    1) ...

    2) @words = (a b b c d a b c)

    3) @offset = (0 1 2 3 4 5 6 7)

    4) @offset = (
    0 # => a b b c d a b c
    5 # => a b c
    1 # => b b c d a b c
    6 # => b c
    2 # => b c d a b c
    7 # => c
    3 # => c d a b c
    4 # => d a b c
    )

    5)
    a b b c d a b c
    a b c
    => a b = 2
    b b c d a b c
    => b b = 1
    b c
    b c d a b c
    => b c = 2
    c
    => c = 1
    c d a b c
    => c d = 1
    d a b c
    => d a = 1