15-5 Edit distance

In order to transform one source string of text x[1..m]x[1..m] to a target string y[1..n]y[1..n], we can perform various transformation operations. Our goal is, given xx and yy, to produce a series of transformations that change xx to yy. We use an array zz—assumed to be large enough to hold all the characters it will need—to hold the intermediate results. Initially, zz is empty, and at termination, we should have z[j]=y[j]z[j] = y[j] for j=1,2,,nj = 1, 2, \ldots, n. We maintain current indices ii into xx and jj into zz, and the operations are allowed to alter zz and these indices. Initially, i=j=1i = j = 1. We are required to examine every character in xx during the transformation, which means that at the end of the sequence of transformation operations, we must have i=m+1i = m + 1.

We may choose from among six transformation operations:

Copy a character from xx to zz by setting z[j]=x[i]z[j] = x[i] and then incrementing both ii and jj. This operation examines x[i]x[i].

Replace a character from xx by another character cc, by setting z[j]=cz[j] = c, and then incrementing both ii and jj. This operation examines x[i]x[i].

Delete a character from xx by incrementing ii but leaving jj alone. This operation examines x[i]x[i].

Insert the character cc into zz by setting z[j]=cz[j] = c and then incrementing jj, but leaving ii alone. This operation examines no characters of xx.

Twiddle (i.e., exchange) the next two characters by copying them from xx to zz but in the opposite order; we do so by setting z[j]=x[i+1]z[j] = x[i + 1] and z[j+1]=x[i]z[j + 1] = x[i] and then setting i=i+2i = i + 2 and j=j+2j = j + 2. This operation examines x[i]x[i] and x[i+1]x[i + 1].

Kill the remainder of xx by setting i=m+1i = m + 1. This operation examines all characters in xx that have not yet been examined. This operation, if performed, must be the final operation.

As an example, one way to transform the source string algorithm\text{algorithm} to the target string altruistic\text{altruistic} is to use the following sequence of operations, where the underlined characters are x[i]x[i] and z[j]z[j] after the operation:

Operationxzinitial stringsalgorithm_copyalgorithma_copyalgorithmal_replace by talgorithmalt_deletealgorithmalt_copyalgorithmaltr_insert ualgorithmaltru_insert ialgorithmaltrui_insert salgorithmaltruis_twiddlealgorithmaltruisti_insert calgorithmaltruistic_killalgorithm_altruistic_ \begin{array}{lll} \text{Operation} & x & z \\ \hline \textit{initial strings} & \underline algorithm & \text{\textunderscore} \\ \text{copy} & a\underline lgorithm & a\text{\textunderscore} \\ \text{copy} & al\underline gorithm & al\text{\textunderscore} \\ \text{replace by $t$} & alg\underline orithm & alt\text{\textunderscore} \\ \text{delete} & algo\underline rithm & alt\text{\textunderscore} \\ \text{copy} & algor\underline ithm & altr\text{\textunderscore} \\ \text{insert $u$} & algor\underline ithm & altru\text{\textunderscore} \\ \text{insert $i$} & algor\underline ithm & altrui\text{\textunderscore} \\ \text{insert $s$} & algor\underline ithm & altruis\text{\textunderscore} \\ \text{twiddle} & algorit\underline hm & altruisti\text{\textunderscore} \\ \text{insert $c$} & algorit\underline hm & altruistic\text{\textunderscore} \\ \text{kill} & algorithm\text{\textunderscore} & altruistic\text{\textunderscore} \end{array}

Note that there are several other sequences of transformation operations that transform algorithm\text{algorithm} to altruistic\text{altruistic}.

Each of the transformation operations has an associated cost. The cost of an operation depends on the specific application, but we assume that each operation's cost is a constant that is known to us. We also assume that the individual costs of the copy and replace operations are less than the combined costs of the delete and insert operations; otherwise, the copy and replace operations would not be used. The cost of a given sequence of transformation operations is the sum of the costs of the individual operations in the sequence. For the sequence above, the cost of transforming algorithm\text{algorithm} to altruistic\text{altruistic} is

(3 cost(copy)) + cost(replace) + cost(delete) + (4 cost(insert)) + cost(twiddle) + cost(kill).\text{($3 \cdot$ cost(copy)) + cost(replace) + cost(delete) + ($4 \cdot$ cost(insert)) + cost(twiddle) + cost(kill)}.

a. Given two sequences x[1..m]x[1..m] and y[1..n]y[1..n] and set of transformation-operation costs, the edit distance from xx to yy is the cost of the least expensive operatoin sequence that transforms xx to yy. Describe a dynamic-programming algorithm that finds the edit distance from x[1..m]x[1..m] to y[1..n]y[1..n] and prints an optimal opeartion sequence. Analyze the running time and space requirements of your algorithm.

The edit-distance problem generalizes the problem of aligning two DNA sequences (see, for example, Setubal and Meidanis [310, Section 3.2]). There are several methods for measuring the similarity of two DNA sequences by aligning them. One such method to align two sequences xx and yy consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting sequences xx' and yy' have the same length but do not have a space in the same position (i.e., for no position jj are both x[j]x'[j] and y[j]y'[j] a space). Then we assign a "score" to each position. Position jj receives a score as follows:

  • +1+1 if x[j]=y[j]x'[j] = y'[j] and neither is a space,
  • 1-1 if x[j]y[j]x'[j] \ne y'[j] and neither is a space,
  • 2-2 if either x[j]x'[j] or y[j]y'[j] is a space.

The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences x=GATCGGCATx = \text{GATCGGCAT} and y=CAATGTGAATCy = \text{CAATGTGAATC}, one alignment is

GATCGGCATCAATGTGAATC++++++ \begin{array}{cccccccccccc} \text G & & \text A & \text T & \text C & \text G & & \text G & \text C & \text A & \text T & \\ \text C & \text A & \text A & \text T & & \text G & \text T & \text G & \text A & \text A & \text T & \text C \\ - & * & + & + & * & + & * & + & - & + & + & * \end{array}

A ++ under a position indicates a score of +1+1 for that position, a - indicates a score of 1-1, and a * indicates a score of 2-2, so that this alignment has a total score of 62142=46 \cdot -2 \cdot 1 - 4 \cdot 2 = -4.

b. Explain how to cast the problem of finding an optimal alignment as an edit distance problem using a subset of the transformation operations copy, replace, delete, insert, twiddle, and kill.

a. We will index our subproblems by two integers, 1im1 \le i \le m and 1jn1 \le j \le n. We will let ii indicate the rightmost element of xx we have not processed and jj indicate the rightmost element of yy we have not yet found matches for. For a solution, we call EDIT(x,y,i,j)\text{EDIT}(x, y, i, j).

b. We will set

cost(delete)=cost(insert)=2,\text{cost(delete)} = \text{cost(insert)} = 2,

cost(copy)=1,\text{cost(copy)} = −1,

cost(replace)=1,\text{cost(replace)} = 1,

and

cost(twiddle)=cost(kill)=.\text{cost(twiddle)} = \text{cost(kill)} = \infty.

Then a minimum cost translation of the first string into the second corresponds to an alignment.

We view

  • a copy\text{copy} or a replace\text{replace} as incrementing a pointer for both strings,
  • a insert\text{insert} as putting a space at the current position of the pointer in the first string, and
  • a delete\text{delete} operation means putting a space in the current position in the second string.

Since twiddle\text{twiddle}s and kill\text{kill}s have infinite costs, we will have neither of them in a minimal cost solution. The final value for the alignment will be the negative of the minimum cost sequence of edits.

EDIT(x, y, i, j)
    let m = x.length
    let n = y.length
    if i == m
        return (n - j)cost(insert)
    if j == n
        return min{(m - i)cost(delete), cost(kill)}
    initialize o1, ..., o5 to ∞
    if x[i] == y[j]
        o1 = cost(copy) + EDIT(x, y, i + 1, j + 1)
    o2 = cost(replace) + EDIT(x, y, i + 1, j + 1)
    o3 = cost(delete) + EDIT(x, y, i + 1, j)
    o4 = cost(insert) + EDIT(x, y, i, j + 1)
    if i < m - 1 and j < n - 1
        if x[i] == y[j + 1] and x[i + 1] == y[j]
            o5 = cost(twiddle) + EDIT(x, y, i + 2, j + 2)
    return min_{i ∈ [5]}{o_i}