I need to find the longest common substring between one string and a list of others. This would be very easy to do naively - find all the substrings of the first string, then check via index() on the others. However this is O(mnk) which is bad, m'kay. So I managed to find an algorithm that uses a matrix to find the longest substring in O(nk). [ m = number of substrings in first string, n = number of strings to look in, k = avg length of strings ].
Unfortunately the algorithm requires looking at each individual character in the strings one by one. Sadly this is very slow in perl, despite perl being perfect for most string matching needs. I hope this is fixed in perl 6. Anyway, I decided the best way to go would be to prototype in Inline::C and then migrate to XS. It turns out I really hate programming in C. All the array indexes being dangerous and lack of being able to print datastructures easily, and the malloc/free nightmare, and ugly for() loops.
Despite all that, I came up with this, which appears to work pretty well:
#!/usr/bin/perl -w
use Inline C;
use Benchmark;
use strict;
my $string = "hello";
my @to_match = qw(aloha helen noddy);
timethese(1_000_000, {
match => sub {
my $max_substr = do_match($string, \@to_match);
if ($max_substr ne 'hel') { die "Bad match: $max_substr"; }
#else { print "good match\n" }
},
});
#print "Found: $max_substr\n";
__END__
__C__
#define mtrx(x,y) ((x) + ((y) * match_len))
SV * do_match( char * fixed, AV * list )
{
I32 i;
I32 list_len = -1;
SV** to_match = NULL;
I32 longest = 0;
I32 answer_end = 0;
const I32 fixed_len = strlen(fixed);
SV * result = newSV(0);
if (result == NULL)
return NULL;
/* length of the array */
list_len = av_len(list);
if (list_len < 0)
return NULL;
for (i = 0; i <= list_len; i++) {
I32 match_len;
char *ptr;
char *matrix;
I32 x, y;
int longer = 0;
/* get the i'th entry out of the array */
to_match = av_fetch(list, i, 0);
if (!to_match)
break;
/* get a char* out of the SV (sets match_len as side effect) */
ptr = SvPV(*to_match, match_len);
/* allocate a long string, and use arithmetic to treat it like a matrix */
matrix = calloc(match_len * fixed_len, sizeof(char));
if (matrix == NULL)
return NULL;
answer_end = 0;
/* NB: using 1,1 as root of matrix so we have a zero'd border to prevent
segfaults when fetching matrix[x-1,y-1] */
for (x = 1; x <= fixed_len; x++) {
for (y = 1; y <= match_len; y++) {
if (fixed[x-1] == ptr[y-1]) {
matrix[mtrx(x,y)] = 1 + matrix[mtrx(x-1,y-1)];
if (matrix[mtrx(x,y)] > longest) {
longest = matrix[mtrx(x,y)];
answer_end = y;
longer = 1;
}
}
}
}
/* if this one was longer than previous ones, we store it in the SV */
if (longer) {
/* null terminate it first! */
ptr[answer_end] = 0;
sv_setpvn(result, &ptr[answer_end - longest], longest);
}
free(matrix);
}
return result;
}
And results in:
$ perl common_substring.pl
Benchmark: timing 1000000 iterations of match...
match: 13 wallclock secs (10.04 usr + 0.03 sys = 10.07 CPU) @ 99304.87/s (n=1000000)
Now I need to run it on some real data (tens of thousands of strings) to see how well it scales!
The explanation of the algorithm can be found here if you're interested.
And that works really well... (Score:1)
Re:And that works really well... (Score:2)
Re:And that works really well... (Score:2)
Re:And that works really well... (Score:2)
Re:And that works really well... (Score:1)
This technique only works on fixed-width characters in sets with no combining characters (In which case it works really well) or in cases where you can ensure that you're only using the
Re:And that works really well... (Score:2)
Re:And that works really well... (Score:1)
Re:And that works really well... (Score:1)
Dynamic programming? hmm! (Score:1)
doesn't seem to be up on the web yet though.
Re:Dynamic programming? hmm! (Score:2)
However there's a number of papers out there on longest common sequence detection, that use similar techniques and algorithms (and all use GTCA in their examples). I guess you'd have to keep a corpus of spam around to determine similarity using this technique though.
Re:Dynamic programming? hmm! (Score:1)
http://cbrg.inf.ethz.ch/bio-recipes/NigerianPrince/code.html
'Phylogenetic tree of the "Nigerian Prince" email scam'
protocol analysis URL (Score:1)
a couple of speedups (Score:2)
You can also stop scanning the tail end of a word if the length of longest match at the current point, plus the length of the rest of the word is smaller than the current longest known match. (I think you can make that check every time you get a mismatch, since when the character matches you are extending a possible match that there was room enough for.
Both of these will help with the scaling - as soon as you've found
Super Search at The Monastery.... (Score:1)
http://www.perlmonks.org/index.pl?node_id=249239
(Longest Common Substring)
It references
String::LCSS
Which when I searched CPAN also turned up Algorithm::LCSS
There is some wheel inventing there too
not surprised (Score:2)
If your string matching needs happen to match Perl's string matching capabilities. Regular expressions are not an end-all solution to string matching in general; there is much more to "string matching", and in many cases Perl's "strings are first class" attitude gets in your way. LCS being one of those cases.
All common substrings (Score:1)
I'm doing some preliminary work with Beowulf in Old English to see if there might be anything interesting to do with such information. With that, I need to look at all possible common substrings between pairs of lines, not just the longest.
Since I don't know enough XS to handle creating arrays in C, I did it all in Perl
Re:All common substrings (Score:2)