The words `computer' and `commuter' are very similar, and a change of just one letter, p>m will change the first word into the second. The word `sport' can be changed into `sort' by the deletion of the `p', or equivalently, `sort' can be changed into `sport' by the insertion of `p'.
The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:
 change a letter,
 insert a letter or
 delete a letter
d('', '') = 0  '' = empty string d(s, '') = d('', s) = s  i.e. length of s d(s1+ch1, s2+ch2) = min( d(s1, s2) + if ch1=ch2 then 0 else 1 , d(s1+ch1, s2) + 1, d(s1, s2+ch2) + 1 )
USING DP
Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.
A twodimensional matrix, m[0..s1,0..s2] is used to hold the edit distance values:
m[i,j] = d(s1[1..i], s2[1..j]) m[0,0] = 0 m[i,0] = i, i=1..s1 m[0,j] = j, j=1..s2 m[i,j] = min(m[i1,j1] + if s1[i]=s2[j] then 0 else 1 fi, m[i1, j] + 1, m[i, j1] + 1 ), i=1..s1, j=1..s2
Complexity
The timecomplexity of the algorithm is O(s1*s2), i.e. O(n^{2}) if the lengths of both strings is about `n'. The spacecomplexity is also O(n^{2}) if the whole of the matrix is kept for a traceback to find an optimal alignment. If only the value of the edit distance is needed, only two rows of the matrix need be allocated; they can be "recycled", and the space complexity is then O(s1), i.e. O(n).Variations
The costs of the point mutations can be varied to be numbers other than 0 or 1. Linear gapcosts are sometimes used where a run of insertions (or deletions) of length `x', has a cost of `ax+b', for constants `a' and `b'. If b>0, this penalises numerous short runs of insertions and deletions.Longest Common Subsequence
The longest common subsequence (LCS) of two sequences, s1 and s2, is a subsequence of both s1 and of s2 of maximum possible length. The more alike that s1 and s2 are, the longer is their LCS.Other Algorithms
There are faster algorithms for the edit distance problem, and for similar problems. Some of these algorithms are fast if certain conditions hold, e.g. the strings are similar, or dissimilar, or the alphabet is large, etc..Ukkonen (1983) gave an algorithm with worst case time complexity O(n*d), and the average complexity is O(n+d^{2}), where n is the length of the strings, and d is their edit distance. This is fast for similar strings where d is small, i.e. when d<
Applications
File Revision
The Unix commanddiff f1 f2
finds the difference between files f1 and f2, producing an edit script to convert f1 into f2. If two (or more) computers share copies of a large file F, and someone on machine1 edits F=F.bak
, making a few changes, to give F.new
, it might be very expensive and/or slow to transmit the whole revised file F.new
to machine2. However, diff F.bak F.new
F.new
. diff
treats a whole line as a "character" and uses a special editdistance algorithm that is fast when the "alphabet" is large and there are few chance matches between elements of the two strings (files). In contrast, there are many chance charactermatches in DNA where the alphabet size is just 4, {A,C,G,T}. Try `
man diff
' to see the manual entry for diff. Remote Screen Update Problem
If a computer program on machine1 is being used by someone from a screen on (distant) machine2, e.g. viarlogin
etc., then machine1 may need to update the screen on machine2 as the computation proceeds. One approach is for the program (on machine1) to keep a "picture" of what the screen currently is (on machine2) and another picture of what it should become. The differences can be found (by an algorithm related to editdistance) and the differences transmitted...
saving on transmission bandwidth. Spelling Correction
Algorithms related to the edit distance may be used in spelling correctors. If a text contains a word, w, that is not in the dictionary, a `close' word, i.e. one with a small edit distance to w, may be suggested as a correction.Transposition errors are common in written text. A transposition can be treated as a deletion plus an insertion, but a simple variation on the algorithm can treat a transposition as a single point mutation.
Plagiarism Detection
The edit distance provides an indication of similarity that might be too close in some situations ... think about it.The edit distance gives an indication of how `close' two strings are. Similar
measures are used to compute a distance between DNA sequences (strings over {A,C,G,T}, or protein sequences (over an alphabet of 20 amino acids), for various purposes, e.g.:
Molecular Biology
Speech Recognition
Algorithms similar to those for the editdistance problem are used in some speech recognition systems: find a close match between a new utterance and one in a library of classified utterances.Computing Levenshtein distance
A commonlyused bottomup dynamic programming algorithm for computing the Levenshtein distance involves the use of an (n + 1) × (m + 1) matrix, where n and m are the lengths of the two strings. This algorithm is based on the WagnerFisher algorithm[2] for edit distance. Here is pseudocode for a function LevenshteinDistance that takes two strings, s of length m, and t of length n, and computes the Levenshtein distance between them:
int LevenshteinDistance(char s[1..m], char t[1..n])
{
// d is a table with m+1 rows and n+1 columns
declare int d[0..m, 0..n]
for i from 0 to m
d[i, 0] := i // deletion
for j from 0 to n
d[0, j] := j // insertion
for j from 1 to n
{
for i from 1 to m
{
if s[i] = t[j] then
d[i, j] := d[i1, j1]
else
d[i, j] := minimum
(
d[i1, j] + 1, // deletion
d[i, j1] + 1, // insertion
d[i1, j1] + 1 // substitution
)
}
}
return d[m,n]
}
Examples
Two examples of the resulting matrix (hovering over a number reveals the operation performed to get that number):


0 comments:
Post a Comment