Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Matts (1087)

Matts
  (email not shown publicly)

I work for MessageLabs [messagelabs.com] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.

Journal of Matts (1087)

Wednesday September 29, 2004
01:35 PM

Longest Common Substring

[ #21113 ]

I need to find the longest common substring between one string and a list of others. This would be very easy to do naively - find all the substrings of the first string, then check via index() on the others. However this is O(mnk) which is bad, m'kay. So I managed to find an algorithm that uses a matrix to find the longest substring in O(nk). [ m = number of substrings in first string, n = number of strings to look in, k = avg length of strings ].

Unfortunately the algorithm requires looking at each individual character in the strings one by one. Sadly this is very slow in perl, despite perl being perfect for most string matching needs. I hope this is fixed in perl 6. Anyway, I decided the best way to go would be to prototype in Inline::C and then migrate to XS. It turns out I really hate programming in C. All the array indexes being dangerous and lack of being able to print datastructures easily, and the malloc/free nightmare, and ugly for() loops.

Despite all that, I came up with this, which appears to work pretty well:

#!/usr/bin/perl -w
 
use Inline C;
use Benchmark;
use strict;
 
my $string = "hello";
my @to_match = qw(aloha helen noddy);
 
timethese(1_000_000, {
    match => sub {
my $max_substr = do_match($string, \@to_match);
if ($max_substr ne 'hel') { die "Bad match: $max_substr"; }
#else { print "good match\n" }
    },
});
 
#print "Found: $max_substr\n";
 
__END__
__C__
 
#define mtrx(x,y) ((x) + ((y) * match_len))
 
SV * do_match( char * fixed, AV * list )
{
    I32 i;
    I32 list_len = -1;
    SV** to_match = NULL;
    I32 longest = 0;
    I32 answer_end = 0;
    const I32 fixed_len = strlen(fixed);
    SV * result = newSV(0);
 
    if (result == NULL)
        return NULL;
 
    /* length of the array */
    list_len = av_len(list);
 
    if (list_len < 0)
        return NULL;
 
    for (i = 0; i <= list_len; i++) {
        I32 match_len;
        char *ptr;
        char *matrix;
        I32 x, y;
        int longer = 0;
 
        /* get the i'th entry out of the array */
        to_match = av_fetch(list, i, 0);
        if (!to_match)
            break;
 
        /* get a char* out of the SV (sets match_len as side effect) */
        ptr = SvPV(*to_match, match_len);
 
        /* allocate a long string, and use arithmetic to treat it like a matrix */
        matrix = calloc(match_len * fixed_len, sizeof(char));
        if (matrix == NULL)
            return NULL;
 
        answer_end = 0;
 
        /* NB: using 1,1 as root of matrix so we have a zero'd border to prevent
           segfaults when fetching matrix[x-1,y-1] */
        for (x = 1; x <= fixed_len; x++) {
            for (y = 1; y <= match_len; y++) {
                if (fixed[x-1] == ptr[y-1]) {
                    matrix[mtrx(x,y)] = 1 + matrix[mtrx(x-1,y-1)];
                    if (matrix[mtrx(x,y)] > longest) {
                        longest = matrix[mtrx(x,y)];
                        answer_end = y;
                        longer = 1;
                    }
                }
            }
        }
 
        /* if this one was longer than previous ones, we store it in the SV */
        if (longer) {
            /* null terminate it first! */
            ptr[answer_end] = 0;
            sv_setpvn(result, &ptr[answer_end - longest], longest);
        }
 
        free(matrix);
    }
 
    return result;
}

And results in:

$ perl common_substring.pl
Benchmark: timing 1000000 iterations of match...
     match: 13 wallclock secs (10.04 usr +  0.03 sys = 10.07 CPU) @ 99304.87/s (n=1000000)

Now I need to run it on some real data (tens of thousands of strings) to see how well it scales!

The explanation of the algorithm can be found here if you're interested.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • as long as you're using an 8-bit character set. Unicode kinda makes this sub-optimal :)
    • s/char/wchar/ :-) (I know this doesn't really work)
    • It should do just as well on UTF8, no? Except for its probable confusion on string length vs. number of bytes..
      • In a multibyte string you get potential clashes between characters with different surrogate bytes but similar character bytes, meaning invalid results.
      • Nope. UTF-8 is particularly bad. In addition to potentially having combining characters in the match, which is a Unicode issue, you can run into lots of cases where the second (and third, and fourth, and fifth...) byte of two multibyte characters are identical but the characters aren't the same because the first byte isn't.

        This technique only works on fixed-width characters in sets with no combining characters (In which case it works really well) or in cases where you can ensure that you're only using the
    • I had an idea that seems to work on big strings. Kind of like Edward Scissor Hands cutting out the "bad stuff" in a bunch and leaving only the "good stuff". If you have two strings a and b, chop up a into pieces that have *characters* in common with b, and vice-versa. then chop up all the pieces. Like salad or a block of ice. Then it is quick to find the real matches. Should work with other things besides 8 bit text. I did it in Python, here-> http://www.livejournal.com/users/bramcohen/22069.html [livejournal.com]
  • I was just hearing about this technique in a great paper on bioinformatics (specifically seq analysis) for blackbox protocol reverse-engineering at Toorcon; I talked to the guy and he's very interested in applying his techniques to spam. worth a read.

    doesn't seem to be up on the web yet though. :(
    • I'm mostly just parroting the page I read - I have no idea what's "dynamic" about this technique. Maybe that's a consequence of programming perl all the time - everything is dynamic anyway!

      However there's a number of papers out there on longest common sequence detection, that use similar techniques and algorithms (and all use GTCA in their examples). I guess you'd have to keep a corpus of spam around to determine similarity using this technique though.
    • btw, here's the URL for that presentation: http://www.baselineresearch.net/PI/PI-Toorcon.pdf
  • You can reject a word immediately if it is shorter than the current longest known match.

    You can also stop scanning the tail end of a word if the length of longest match at the current point, plus the length of the rest of the word is smaller than the current longest known match. (I think you can make that check every time you get a mismatch, since when the character matches you are extending a possible match that there was room enough for.

    Both of these will help with the scaling - as soon as you've found

  • turns up

    http://www.perlmonks.org/index.pl?node_id=249239
    (Longest Common Substring)

    It references
    String::LCSS

    Which when I searched CPAN also turned up Algorithm::LCSS

    There is some wheel inventing there too
  • > Sadly this is very slow in perl, despite perl being perfect for most string matching needs.

    If your string matching needs happen to match Perl's string matching capabilities. Regular expressions are not an end-all solution to string matching in general; there is much more to "string matching", and in many cases Perl's "strings are first class" attitude gets in your way. LCS being one of those cases.

  • Reading the algorithm made me go back and modify it to find all common substrings. Takes a *lot* longer to run, but for me is more useful.

    I'm doing some preliminary work with Beowulf in Old English to see if there might be anything interesting to do with such information. With that, I need to look at all possible common substrings between pairs of lines, not just the longest.

    Since I don't know enough XS to handle creating arrays in C, I did it all in Perl :/ Only takes 40-50 times longer, but I have al
    • Creating arrays in C is easy:
      AV * foo = newAV();
      av_push(foo, newSVpv(thestring, 0));
      You can just return the AV * and the builtin perl typemap will turn it into a list for you.