The levenshtein distance between 2 strings represent the minimal differences that must be applied to the first string to get it to be equivalent to the seconds string.
The 3 operations that be applied are (1) Insertion, (2) Deletion, (3) Swap. An insertion and deletion can be at any point in the string and a swap is just changing an existing letter. There is a recursive equation that is used in a dynamic programming fashion to determines the levenshtein distance of 2 strings.
Let's breakdown this equation and how it works.
and
are the inclusive terminating indices of a string. You must consider these to be be 1-indexed where the first character has an index of 1. If it is not clear why we are doing this, I think it will make more sense when we do an example as the reason is highly motivated by the recursive nature of levenshtein distance.
can be considered the base case. The condition necessary for this is that the minimal index of either
or
is 0; or in otherwords, comparing 2 strings where at least 1 string is the empty string. In the case that one string is an empty string, the levenshtein distance will just be the length of the other string. This also holds true when both strings are empty string.
. In this case, the levenshtein distance does not increase.
is insertion/deletion
is insertion/deletion w.r.t the other string.
is a swapThe classical example for levenshtein distance is finding the distance between "kitten" and "sitting". There are several good exampled layed out online. I will go over the example using "france" and "strange".
Note that #
represents a empty string
i 0 1 2 3 4 5 6 7
j # s t r a n g e
0 #
1 f
2 r
3 a
4 n
5 c
6 e
. Consider
and
. The minimum length is
and the max is
. The intuition behind this is because inorder to transform and empty string ""
to "str"
is 3 insertions of s
,t
, and r
i 0 1 2 3 4 5 6 7
j # s t r a n g e
0 # 0 1 2 3 4 5 6 7
1 f 1
2 r 2
3 a 3
4 n 4
5 c 5
6 e 6
Consider ,
. Since neither index is
and the characters at those indices don't match, we are left with the following.
represents a table lookup at
where we store the number of moves necessary to get the substring with terminating indices of
and
to match. We then add 1 to represent an insertion.
represents a table lookup at
where again, the necessary moves are store. This time the insertion can be thought of with respects to the other substring.
, in the case where the two characters are different, represent swapping out a letter to be another. This can be with respects to either word.We will then repeat this process for all the missing entries.
i 0 1 2 3 4 5 6 7
j # s t r a n g e
0 # 0 1 2 3 4 5 6 7
1 f 1 1 2 3 4 5 6 7
2 r 2 1 2 2 3 4 5 6
3 a 3 2 2 3 2 3 4 5
4 n 4 3 3 3 3 2 3 4
5 c 5 4 4 4 4 3 3 4
6 e 6 5 5 5 5 5 5 3
The levenshtein distance will be the number in the bootom right on the table. In this case its 3. We can confirm this because be inspection: (1) swap f
and s
, (2) insert t
after the first swap then (3) swap c
for g
.
def lev(s1, s2):
m = len(s1)
n = len(s2)
opt = [[-1 for _ in range(m + 1)] for _ in range(n + 1)]
#initialize base cases
for i in range(m + 1):
opt[0][i] = i
for j in range(n + 1):
opt[j][0] = j
# fill in table with recursive case
for i in range(1,n+1):
for j in range(1,m+1):
# offset the index for 1-indexness of the table iteration
if s1[j-1] == s2[i-1]:
opt[i][j] = opt[i-1][j-1]
else:
minimum = min(opt[i-1][j], opt[i][j-1], opt[i-1][j-1])
opt[i][j] = minimum + 1
return opt[n][m]
This happens to be a leetcode problem.