15-5 Edit distance
In order to transform one source string of text to a target string , we can perform various transformation operations. Our goal is, given and , to produce a series of transformations that change to . We use an array —assumed to be large enough to hold all the characters it will need—to hold the intermediate results. Initially, is empty, and at termination, we should have for . We maintain current indices into and into , and the operations are allowed to alter and these indices. Initially, . We are required to examine every character in during the transformation, which means that at the end of the sequence of transformation operations, we must have .
We may choose from among six transformation operations:
Copy a character from to by setting and then incrementing both and . This operation examines .
Replace a character from by another character , by setting , and then incrementing both and . This operation examines .
Delete a character from by incrementing but leaving alone. This operation examines .
Insert the character into by setting and then incrementing , but leaving alone. This operation examines no characters of .
Twiddle (i.e., exchange) the next two characters by copying them from to but in the opposite order; we do so by setting and and then setting and . This operation examines and .
Kill the remainder of by setting . This operation examines all characters in that have not yet been examined. This operation, if performed, must be the final operation.
As an example, one way to transform the source string to the target string is to use the following sequence of operations, where the underlined characters are and after the operation:
Note that there are several other sequences of transformation operations that transform to .
Each of the transformation operations has an associated cost. The cost of an operation depends on the specific application, but we assume that each operation's cost is a constant that is known to us. We also assume that the individual costs of the copy and replace operations are less than the combined costs of the delete and insert operations; otherwise, the copy and replace operations would not be used. The cost of a given sequence of transformation operations is the sum of the costs of the individual operations in the sequence. For the sequence above, the cost of transforming to is
a. Given two sequences and and set of transformation-operation costs, the edit distance from to is the cost of the least expensive operatoin sequence that transforms to . Describe a dynamic-programming algorithm that finds the edit distance from to and prints an optimal opeartion sequence. Analyze the running time and space requirements of your algorithm.
The edit-distance problem generalizes the problem of aligning two DNA sequences (see, for example, Setubal and Meidanis [310, Section 3.2]). There are several methods for measuring the similarity of two DNA sequences by aligning them. One such method to align two sequences and consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting sequences and have the same length but do not have a space in the same position (i.e., for no position are both and a space). Then we assign a "score" to each position. Position receives a score as follows:
- if and neither is a space,
- if and neither is a space,
- if either or is a space.
The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences and , one alignment is
A under a position indicates a score of for that position, a indicates a score of , and a indicates a score of , so that this alignment has a total score of .
b. Explain how to cast the problem of finding an optimal alignment as an edit distance problem using a subset of the transformation operations copy, replace, delete, insert, twiddle, and kill.
a. We will index our subproblems by two integers, and . We will let indicate the rightmost element of we have not processed and indicate the rightmost element of we have not yet found matches for. For a solution, we call .
b. We will set
and
Then a minimum cost translation of the first string into the second corresponds to an alignment.
We view
- a or a as incrementing a pointer for both strings,
- a as putting a space at the current position of the pointer in the first string, and
- a operation means putting a space in the current position in the second string.
Since s and s have infinite costs, we will have neither of them in a minimal cost solution. The final value for the alignment will be the negative of the minimum cost sequence of edits.
EDIT(x, y, i, j)
let m = x.length
let n = y.length
if i == m
return (n - j)cost(insert)
if j == n
return min{(m - i)cost(delete), cost(kill)}
initialize o1, ..., o5 to ∞
if x[i] == y[j]
o1 = cost(copy) + EDIT(x, y, i + 1, j + 1)
o2 = cost(replace) + EDIT(x, y, i + 1, j + 1)
o3 = cost(delete) + EDIT(x, y, i + 1, j)
o4 = cost(insert) + EDIT(x, y, i, j + 1)
if i < m - 1 and j < n - 1
if x[i] == y[j + 1] and x[i + 1] == y[j]
o5 = cost(twiddle) + EDIT(x, y, i + 2, j + 2)
return min_{i ∈ [5]}{o_i}