Comparing two strings and measuring how different or similar they are is a fundamental problem in computer science, with applications in evolutionary and structural biology, textual database retrieval, and spelling correction. Among the various ways of formalizing the notion of distance between two strings, the most common and simplest formalization is known as edit distance.
Edit distance measures the difference between two strings in terms of the minimum number of elementary operations required to transform one string into the other. Closely related to edit distance is the concept of string alignment, which provides an alternative, more visual way of representing the same relationship between two strings. Both concepts are widely used together and are foundational to the study of dynamic programming applied to string comparison.
Edit Operations
The transformation of one string into another is carried out using a set of permitted operations performed on individual characters. These operations are:
- Insertion (I) – inserting a character into the first string.
- Deletion (D) – deleting a character from the first string.
- Substitution or Replacement (R) – replacing a character in the first string with a character from the second string.
- Match (M) – a non-operation, indicating that a character is common to both strings and requires no change.
These four symbols—I, D, R, and M—are used to construct what is known as an edit transcript.
Edit Transcript
Definition: A string composed of the symbols I, D, R, and M that describes how one string can be transformed into another is called an edit transcript.
To understand how an edit transcript works, two pointers, next1 and next2, are maintained over the first string (S1) and the second string (S2) respectively. Both pointers initially point to the first character of their respective strings. The transcript is read from left to right, and operations are applied as follows:
- When the symbol I is encountered, the character pointed to by next2 is inserted into S1, and next2 is incremented.
- When the symbol D is encountered, the character pointed to by next1 is deleted from S1, and next1 is incremented.
- When the symbol R or M is encountered, the character at next1 is either replaced or matched with the character at next2, and both pointers are incremented.
Example: Consider transforming the string “vintner” into “writers.” The edit transcript for this transformation is:
R I M D M D M M I
v _ i n t n e r _
w r i _ t _ e r s
Here, v is replaced by w (R), r is inserted (I), i is matched (M), n is deleted (D), t is matched (M), n is deleted (D), e and r are matched (M, M), and finally s is inserted (I).
Edit Distance
Definition: The edit distance between two strings is defined as the minimum number of edit operations—insertions, deletions, and substitutions—required to transform the first string into the second string. It is important to note that matches are not counted while calculating edit distance, since a match does not represent a change.
Edit distance is also known as Levenshtein distance, named after V. Levenshtein, who is credited with first introducing this concept.
An edit transcript that achieves the transformation using the minimum possible number of operations is referred to as an optimal transcript. Since more than one transcript may achieve this same minimum number of operations, such transcripts are collectively called cooptimal transcripts.
The edit distance problem therefore involves two tasks:
- Computing the numerical edit distance between two given strings.
- Determining an optimal edit transcript that describes the actual transformation.
Alternative Interpretation of Edit Distance
Although edit distance is defined as the minimum number of operations needed to convert the first string into the second, it can equivalently be viewed as the minimum number of operations performed on either of the two strings so that both strings are transformed into a common third string. This interpretation is mathematically equivalent to the original definition, because an insertion performed on one string can always be viewed as a deletion performed on the other string, and vice versa.
String Alignment
Definition: A global alignment of two strings S1 and S2 is obtained by inserting chosen spaces (or dashes) into or at the ends of S1 and S2, and then placing the two resulting strings one above the other, such that every character or space in either string corresponds to a unique character or space in the other string.
The term “global” is used to emphasize that the entire length of both strings participates in the alignment. This is in contrast to local alignment, where only selected portions of the strings are aligned (a concept studied separately).
Example: Consider the alignment of the strings “qacdbd” and “qawxb”:
q a c _ d b d
q a w x _ b _
In this alignment, the character c is mismatched with w, and the remaining positions involve either matches or spaces opposite characters.
Relationship Between Alignment and Edit Transcript
From a mathematical point of view, an alignment and an edit transcript are equivalent ways of describing the relationship between two strings, and one can be converted into the other. The correspondence between the two is as follows:
- Two opposing characters that mismatch in an alignment correspond to a substitution (R) in the edit transcript.
- A space in the first string of the alignment corresponds to an insertion (I) of the opposing character into the first string.
- A space in the second string of the alignment corresponds to a deletion (D) of the opposing character from the first string.
Therefore, the edit distance between two strings can also be defined as the alignment that minimizes the total number of mismatched opposing characters plus the number of characters opposite spaces.
Conceptual Difference Between Alignment and Edit Transcript
Although alignment and edit transcript are mathematically equivalent, they differ conceptually:
- An edit transcript emphasizes the actual mutational events (such as point mutations) that transform one string into another. It represents the process of transformation.
- An alignment, on the other hand, only displays the final relationship or correspondence between the two strings. It represents the product of the transformation, without specifying the process.
This distinction is significant in evolutionary modeling, because different evolutionary models may permit different sets of string operations, and yet these different models can still result in the same final alignment. In other words, an alignment alone does not reveal which underlying mutational model produced it. For this reason, the language of “alignment” is often preferred over “edit transcript” in comparative and evolutionary studies, since it is more neutral and does not commit to any particular process. Alignment terminology is also more natural and convenient when extending the discussion to multiple sequence comparison, where more than two strings are compared simultaneously.
Significance of Edit Distance and Alignment
- They provide a precise, quantitative measure of similarity or dissimilarity between two strings.
- They form the mathematical basis for dynamic programming algorithms used in sequence comparison.
- They are extensively applied in bioinformatics for comparing DNA, RNA, and protein sequences.
- They are used in spelling correction systems to suggest the closest matching word.
- They are applied in textual database retrieval to identify approximate matches.
Conclusion
Edit distance and string alignment together form the conceptual foundation for comparing two strings. While edit distance provides a numerical measure of dissimilarity based on the minimum number of insertion, deletion, and substitution operations, alignment offers a visual and structural representation of the same relationship. Both concepts are interconvertible, yet alignment is generally preferred in comparative and evolutionary biology due to its neutrality regarding the underlying process of transformation. These ideas serve as the starting point for more advanced techniques in dynamic programming, including local alignment and multiple sequence alignment.











