Tags: , , ,

# Minimum edit distance between two strings

Minimum edit distance between two strings is minimum number of operations one need to perform on one string so that it transforms into another. Operation allowed are insertion of character, deletion of character and substitution of a character. For example, String S1  = EXPONENTIAL String S2 = POLYNOMIAL

From above example we can see we have to find the best possible alignment of two strings. However, there are so many alignments possible with two string, it will be very costly for consider each and every alignment and look for the best.

Can we break the problem in smaller and easy to solve subproblems? Problem at hand is to find minimum edit distance between X[1…n] and Y[1…m] strings, where n and m are lengths of two strings. Consider prefix of each string X[1…i] and Y[1…j], let’s find edit distance for these prefixes and lets call it Edit(i,j). Finally we need to calculate Edit(n,m).   At each character we have three choices, Let’s consider each case one by one:

If character `X[i]  != Y[j]`, we make the choice delete character X[i], which costs us 1. Now we have i-1 characters in X and j characters in Y to consider which is nothing but `Edit(i-1,j)`.

Second choice we have is to add a character from X, which costs 1.  In this, we have an extra character but have not processed any of the original characters in X, however, character `Y[j]` now matches with new character inserted, so no need to include that. Problem reduces to `Edit(i, j-1)`.

Third choice is to replace the character X[i] with Y[j]. Cost of replacing character is 1 if `X[i] != X[j]`, however, if `X[i] == Y[j]`, cost is 0. In any case, problem reduces to `Edit(i-1, j-1)`.

We do not know which one to pick to start with, so we will try all of them on each character and pick the one which gives us the minimum value. Original  problem can be defined in terms of subproblems as follows:

`Edit(i,j) = min ( 1 + Edit(i,j-1),                   1 + Edit(i-1,j),                  Edit(i-1, j-1) if X[i] == Y[j]                  1 + Edit(i-1, j-1) if X[i] != Y[j] )`

What will be the base case? If both strings are of length zero, cost will be 0.
If one string is of length 0, then cost will be length of other string.

Recursive implementation of edit distance problem

```#include&amp;lt;stdio.h&amp;lt;
#include&amp;lt;string.h&amp;lt;

int min(int a, int b) {
int min = a &amp;gt; b ? b : a;
return min;
}

int editDistance(char *s1, char *s2, int length1, int length2){

printf("\nlength1 = %d, length2 = %d" , length1, length2);

if(length1 == 0 &amp;&amp; length2 == 0) return 0;

if(length1 == 0) return length2;

if(length2 == 0) return length1;

int isCharacterEqual = s1[length1] == s2[length2] ? 0 : 1;
return min( min(( 1 + editDistance(s1,s2, length1-1, length2)),
( 1 + editDistance(s1,s2,length1, length2-1))
),
(isCharacterEqual + editDistance(s1,s2, length1-1,
length2-1)
);
}
int main(){
char *s = "EXPONENTIAL";
char *d = "POLYNOMIAL";
printf("Minimum distance between two strings is : %d",
editDistance(s,d, strlen(s), strlen(d)));

return 0;
}
```

If we look at the execution trail, it is evident that we are solving same subproblems again and again.

Now, we know two things. First optimal solution to original problem depends on optimal solution of subproblems (see recursive relation above). Second, there are overlapping subproblems, which are recalculated again and again. How can we avoid solving same problem again? Well, store it for later use. That concept is called Memoization and used in dynamic programming.

To implement above formula in dynamic programming, a two dimensional table is required where Table(i,j) stores Edit(i,j) and every cell can be calculated with bottom up approach. At the end Table(n,m) gives the final minimum edit distance. Does not matter, if we fill table row wise or column wise, when we reach at cell (i,j), we will have all the required cells already filled in. To start with Table[0,i]  = i and Table[j,0] = j.Why? Look at the base case for recursive relation.

### Edit distance between two strings : Dynamic programming implementation

```int editDistance(char *s1, char *s2){
int n = strlen(s1);
int m = strlen(s2);

int minimumDistance = 0;
int currentMinimum  = 0;
int Table[n+1][m+1] ;

memset(Table,0, sizeof(Table));

//Intitialization
for(int i=0; i&amp;le;n; i++)
Table[i] =i;

for(int i=1; i&amp;le;m; i++)
Table[i] = i;

for(int i=1; i&amp;le;n; i++){
for(int j=1; j&amp;le;m; j++){
//Case 3 : Possibility 1 :If X[i] == Y[j]
if(s1[i-1] == s2[j-1]){
currentMinimum = Table[i-1][j-1];
}
//Case 3 : Possibility 2 :If X[i] != Y[j]
else{
currentMinimum =  Table[i-1][j-1] + 1;
}
//Case 1 : Deletion of character from S1
if(Table[i][j-1] &amp;gt; Table[i-1][j]){
minimumDistance = Table[i-1][j] + 1;
}
//Case 2 : Addition of character on S1
else {
minimumDistance = Table[i][j-1] + 1;
}
if(currentMinimum &amp;lt; minimumDistance){
minimumDistance = currentMinimum;
}
Table[i][j] = minimumDistance;
}
}
return Table[n-1][m-1];
}
```

Complexity of algorithm to find minimum edit distance between two strings is O(n2) with extra space complexity of O(n2).

Please share if there is something wrong or missing. If you are interested in contributing to website, please reach out to us on communications@algorithmsandme.com