Weighted Edit Distance Explained: Operation-Weight & Alphabet-Weight (Notes)

Edit distance measures how many insertions, deletions, and substitutions are required to transform one string into another. The basic edit distance model treats every insertion, deletion, and substitution as having equal cost, which is unrealistic in many practical situations, particularly in molecular biology, where different mutations occur with different frequencies and carry different biological significance. The concept of weighted edit distance generalises the basic edit distance problem by assigning different costs to different edit operations, or even to different characters involved in those operations. This generalisation makes the model far more powerful and applicable to real-world problems such as DNA and protein sequence comparison. Weighted edit distance is broadly classified into two types: operation-weight edit distance and alphabet-weight edit distance.

Operation-Weight Edit Distance

Definition

In operation-weight edit distance, a fixed weight or cost is assigned to each type of edit operation, regardless of which specific characters are involved. Three weights are defined:

d – the weight of an insertion or a deletion (also called a “space” or “gap”)
r – the weight of a substitution (a mismatch)
e – the weight of a match (usually small, and often taken as zero)

The operation-weight edit distance problem is defined as the problem of finding an edit transcript that transforms string S₁ into string S₂ with the minimum total operation weight.

The ordinary (unweighted) edit distance studied earlier is simply a special case of this generalisation, obtained by setting d = 1, r = 1, and e = 0.

Illustration

If a mismatch is given a weight of 2, a space (insertion/deletion) a weight of 4, and a match a weight of 1, then for the alignment of the strings “writers” and “vintner”, an alignment such as:

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Gaps in Sequence Alignment and Their Role in cDNA Matching

w r i t . e r s
v i n t n e r

can be shown to have a total weight of 17, and this happens to be the optimal (minimum) alignment for the given weights.

Condition on Substitution Weight

Since a substitution can always be simulated by performing a deletion followed by an insertion, it is necessary that the weight assigned to a substitution be strictly less than the sum of the weights of a deletion and an insertion. If this condition is not maintained, the algorithm would never actually choose a substitution operation, because deleting and then inserting would always be cheaper or equal, making the substitution operation redundant in the model.

Recurrence Relation

Let D(i, j) denote the minimum total weight required to transform the prefix S₁[1..i] into the prefix S₂[1..j]. Define t(i, j) as follows:

t(i, j) = e, if S₁(i) = S₂(j) (i.e., the characters match)
t(i, j) = r, if S₁(i) ≠ S₂(j) (i.e., the characters differ, a substitution is needed)

Base conditions:

D(i, 0) = i × d D(0, j) = j × d

General recurrence:

D(i, j) = min [ D(i, j−1) + d, D(i−1, j) + d, D(i−1, j−1) + t(i, j) ]

This recurrence is only a minor modification of the standard edit distance recurrence, where the constant costs of 1 (for indels) and the 0/1 cost (for match/mismatch) are simply replaced by the general weights d, r, and e.

Time Complexity

For two strings of length n and m, the operation-weight edit distance can be computed in O(nm) time using dynamic programming, since the table D(i, j) has (n+1)(m+1) entries, and each entry is computed in constant time from previously computed entries.

Graph-Theoretic Representation

The operation-weight edit distance problem can also be formulated and solved as a shortest path problem on a weighted edit graph, where every edge in the graph is assigned a weight corresponding to the cost of the edit operation it represents (insertion, deletion, or substitution/match). Finding the optimal alignment then corresponds to finding the shortest path from the source to the sink in this weighted graph.

Alphabet-Weight Edit Distance

Definition

Alphabet-weight edit distance is a further and more refined generalisation of edit distance. Here, the weight or score of a substitution is allowed to depend on exactly which pair of characters from the alphabet is involved in the substitution, rather than being a single fixed cost r for all mismatches. For example, replacing the character A with T might be given a different (higher or lower) cost than replacing A with G, reflecting the fact that some substitutions are biologically more likely or less costly than others. Similarly, the cost of inserting or deleting a particular character may also depend on which character of the alphabet is being inserted or deleted.

An important clarification is that in alphabet-weight edit distance, the weight of an operation depends only on which characters are involved, and not on the position at which those characters occur in the string.

Relationship with Operation-Weight Edit Distance

The operation-weight edit distance problem is a special case of the alphabet-weight edit distance problem: if all substitutions are given the same cost r (irrespective of the character pair) and all insertions/deletions are given the same cost d (irrespective of the character), the alphabet-weight problem reduces exactly to the operation-weight problem. The dynamic programming recurrence used for operation-weight edit distance can be modified in a straightforward manner to handle alphabet-dependent weights, simply by replacing the constants d, r, e with functions of the specific characters involved.

Applications in Biology

Protein comparison: When comparing protein sequences, the term “edit distance” almost always refers to the alphabet-weight edit distance computed over the alphabet of amino acids. This is because different amino acid substitutions have very different biological likelihoods and functional consequences.
- The most widely used scoring schemes for amino acids are the PAM (Point Accepted Mutation) matrices, developed by Dayhoff, and the newer BLOSUM (Blocks Substitution Matrix) matrices, developed by the Henikoffs.
- These matrices are technically defined for a maximisation (similarity) problem rather than a minimisation (distance) problem, but the two are closely related.
- A body of mathematical theory has been developed to explain how such scoring schemes should be interpreted, and how they should relate to the underlying biological data and to the type of sequence search being performed.
DNA comparison: When comparing DNA strings, unweighted or operation-weight edit distance is more commonly used, since there are only four nucleotide bases and biological substitution patterns are comparatively simpler. For example, the popular sequence database search program BLAST scores an identity (match) as +5 and a mismatch as −4. Nevertheless, alphabet-weighted edit distance schemes have also been proposed for DNA sequence comparison in certain contexts.

String Similarity

Definition and Motivation

Edit distance is one method of formalising how “related” two strings are. An alternative and often preferred approach, especially in biological applications, is to measure the similarity between two strings rather than their distance. Similarity is preferred in biology for technical reasons connected to local alignment and scoring flexibility.

Scoring Matrix

Let Σ be the alphabet used for the strings S₁ and S₂, and let Σ’ be the alphabet Σ together with an additional special character “_” representing a space (used for insertions/deletions). For any two characters x and y in Σ’, the function s(x, y) denotes the score obtained by aligning character x against character y. This function is generally called the scoring matrix (in general computer science terminology) — though in molecular biology it is more commonly referred to as an “amino acid substitution matrix” or “nucleotide substitution matrix,” since biologists reserve the term “weight matrix” for a different concept altogether.

Value of an Alignment

For a given alignment A of the strings S₁ and S₂, let S₁′ and S₂′ denote the two strings after the necessary spaces have been inserted, and let l denote their common (equal) length under this alignment. The value of alignment A is defined as:

Value(A) = Σ (from i = 1 to l) of s(S₁′(i), S₂′(i))

That is, every position i in the alignment specifies a pair of opposing characters (possibly including a space), and the total value of the alignment is obtained by summing the individual scores contributed at each aligned position.

Numerical Example

Given an alphabet Σ = {a, b, c, d} and a scoring matrix defining values for every pair of characters (including the space character), consider the alignment:

c a c d b d
c a b b d b

Using the given scoring matrix values position by position, the total value of this alignment works out to: 0 + 1 − 2 + 0 + 3 + 3 − 1 = 4

Sign Convention in Similarity Scoring

In similarity-based scoring schemes, it is conventional that:

s(x, y) ≥ 0, when characters x and y match (or are considered similar)
s(x, y) < 0, when characters x and y mismatch (or a space is involved)

With such a convention, the objective becomes to find an alignment with the largest possible value, since a larger value indicates more matches/similarities and fewer mismatches or gaps. This is the opposite optimisation direction compared to edit distance, where the goal is to minimise the total cost.

Definition of String Similarity

Given a pairwise scoring matrix defined over the alphabet Σ’, the similarity of two strings S₁ and S₂ is defined as the value of that alignment A of S₁ and S₂ which maximises the total alignment value. This maximum value is also referred to as the optimal alignment value of S₁ and S₂.

Relationship Between Similarity and Weighted Edit Distance

String similarity is closely related to alphabet-weight edit distance, and in many cases, depending on the specific scoring matrix chosen, one problem can be transformed into the other by suitable adjustment of the sign and scale of the weights. However, an important distinction between similarity-based alignment and weighted edit distance becomes apparent when local alignment is introduced, since local alignment (finding the best-matching substrings rather than aligning entire strings end-to-end) is naturally formulated using similarity scoring rather than distance minimisation. This is why the similarity framework, rather than the edit-distance framework, is generally preferred in biological sequence analysis, particularly for local alignment problems.

Significance of Weighted Edit Distance

It provides a far more realistic and flexible model for real biological mutation processes, since not all mutations are equally likely or equally significant.
It allows domain-specific knowledge (such as known mutation rates between amino acids) to be incorporated directly into the alignment algorithm through appropriate scoring/weight matrices.
It generalises the unweighted edit distance model as a special case, so all algorithms designed for weighted edit distance remain applicable to the simpler unweighted case.
It forms the mathematical foundation for practical bioinformatics tools such as BLAST and for local alignment algorithms used extensively in genome and protein sequence analysis.

Conclusion

Weighted edit distance extends the basic edit distance framework by allowing costs to vary either by operation type (operation-weight edit distance) or by the specific characters involved (alphabet-weight edit distance). Both generalisations can be computed efficiently using dynamic programming in O(nm) time, using recurrences that are simple modifications of the classical edit distance recurrence. In biological applications, alphabet-weight edit distance using matrices such as PAM and BLOSUM is standard for protein comparison, while simpler operation-weight schemes such as those used in BLAST are common for DNA comparison. Closely related to weighted edit distance is the concept of string similarity, which reformulates the problem as a maximisation problem using a scoring matrix, and which is generally the preferred formulation in biological sequence analysis, especially for local alignment.

TRENDING ON HEALTH (TOP)

Fick Method Underestimates Heart Problems in Children After Heart Transplant, Study Finds

For Women on Antidepressants, Creatine Showed a Possible Extra Boost

Did the iPhone Quietly Reshape When and Whether Americans Have Children?

For People Antidepressants Never Helped, a 30-Minute Home Session Is Now FDA-Approved

NOW ON AIR (RBC)

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Gaps in Sequence Alignment and Their Role in cDNA Matching

Local Alignment: Finding Substrings of High Similarity | Notes