Longest Common Subsequence

Suppose you and I each had an ordered list of items and we were interested in comparing how similar those lists are. One calculation we can perform on these two strings is the Longest Common Subsequence. A sequence X is an ordered list of elements <x₁, …, x_n>. A subsequence Z is another sequence where (1) Each element of Z is also an element of X and (2) The elements of Z occur in the same order (in Z) as they do in X.

Note that we do not say that the elements of Z need to be a continuous block of elements. If this were true we would be defining a substring. So as an example, suppose we have as an initial string,
X = C, O, M, P, U, T, E, R.
Then the following are all subsequences:
Z1 = C, M, U, T, R
Z2 = C, O, M, P
Z3 = U, T, E, R
Z4 = O, P, T, E

I will note that Z2 and Z3 are also substrings since they contain continuous sets of characters.

The length of a substring is simply the number of characters it contains. So X has length 8, Z1 has length 5, Z2, Z3 and Z4 have length 4.

Suppose now that we had a second string, Y = P, R, O, G, R, A, M, M, E, R and are interested in the longest common subsequence between the two. We can do that by observing that there is a bit of recursion going on with this question. What I mean by that is that asking the question of “What is the longest common subsequence between X and Y” is the same as asking “What is the longest common subsequence between X and Y once we have seen 8 characters of X and 10 characters of Y”

There are three possible ways to answer this question.

If X8 equals Y10, then we ask the same question about X7 and Y9 and add 1 to the answer.
If X8 is not equal to Y10, then the answer to this will be the same as the maximum of the pair X7, Y10 and the pair X8, Y9.
If we reach a situation where we reach the beginning of either string, we are forced to answer 0 to that question.

Then the function has the following look:

LCS(X_i, Y_j) =

0, if i is 0 or j is 0

1 + LCS(X_i-1, Y_j-1) if X_i equals Y_j

max(LCS(X_i-1, Y_j), LCS(X_i, Y_j-1))

Below is a table showing how we would solve the problem mentioned.

The strategy used to devise this concept is called dynamic programming. It is useful we can solve larger problems by solving overlapping subproblems, as was the case here. In this situation we generally can store the data in a table form and avoid re-solving subproblems for which many larger problems will be dependent.

You can see this algorithm in action at LEARNINGlover.com: Longest Common Subsequece. Let me know what you think.

Longest Common Subsequence

Comments

One response to “Longest Common Subsequence”

Leave a Reply Cancel reply

More posts

Latin Squares are so Beautiful

Puzzle Generation Idea (Its Working)

The Beauty of Euler’s phi function

Pascal’s Triangle