Dynamic Programming Approach to Edit Distance Calculation

Edit distance between two strings is the minimum number of edit operations (insertion, deletion, replacement, match) required to transform one string into another. It is computed using dynamic programming, which has three components: the recurrence relation, the tabular computation, and the traceback. This answer explains the definition of the problem and the derivation and proof of the recurrence relation.

Definition of D(i, j)

For two strings S1 and S2, D(i, j) is defined as the edit distance between the first i characters of S1, that is S1[1..i], and the first j characters of S2, that is S2[1..j]. If S1 has n characters and S2 has m characters, the edit distance between the complete strings is D(n, m). Instead of computing D(n, m) directly, dynamic programming solves D(i, j) for all values of i from 0 to n and j from 0 to m, building up to the final answer.

Base Conditions

D(i, 0) = i, since the only way to convert i characters of S1 into zero characters is to delete all of them.
D(0, j) = j, since j insertions are needed to build S2’s first j characters from an empty string.

The Recurrence Relation

When both i and j are strictly positive:

D(i, j) = min [ D(i−1, j) + 1, D(i, j−1) + 1, D(i−1, j−1) + t(i, j) ]

where t(i, j) = 0 if S1(i) = S2(j), and t(i, j) = 1 if S1(i) ≠ S2(j).

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Gaps in Sequence Alignment and Their Role in cDNA Matching

Correctness of the Recurrence Relation

Correctness is proved using the idea of an edit transcript, a sequence of symbols (I, D, R, M) representing the operations used. The proof has two parts.

Lemma 1 — D(i, j) must be one of the three values

Considering the last symbol of an optimal transcript transforming S1[1..i] into S2[1..j]:

If it is Insertion (I), the earlier part of the transcript optimally transforms S1[1..i] into S2[1..j−1], so D(i, j) = D(i, j−1) + 1.
If it is Deletion (D), the earlier part transforms S1[1..i−1] into S2[1..j], so D(i, j) = D(i−1, j) + 1.
If it is Replacement (R) or Match (M), the earlier part transforms S1[1..i−1] into S2[1..j−1], giving D(i, j) = D(i−1, j−1) + t(i, j), since t(i, j) unifies both cases (0 for match, 1 for replacement).

As every transcript must end in one of these four symbols, D(i, j) can only take one of these three values.

Lemma 2 — D(i, j) is at most the minimum of the three values

Each of the three values can be shown to correspond to an actual feasible transformation:

D(i, j−1) + 1: optimally transform S1[1..i] to S2[1..j−1], then insert S2(j).
D(i−1, j) + 1: optimally transform S1[1..i−1] to S2[1..j], then delete S1(i).
D(i−1, j−1) + t(i, j): optimally transform S1[1..i−1] to S2[1..j−1], then match or replace the last character as needed.

Since all three are feasible, their minimum is also feasible, so D(i, j) ≤ minimum of the three values.

Theorem

Combining Lemma 1 (D(i, j) must equal one of the three values) and Lemma 2 (D(i, j) is ≤ the minimum of the three values), it follows that D(i, j) is exactly equal to the minimum:

D(i, j) = min [ D(i−1, j) + 1, D(i, j−1) + 1, D(i−1, j−1) + t(i, j) ]

Significance

This recurrence relation, along with the base conditions, allows D(i, j) to be computed using only smaller previously solved values, forming the basis for the tabular computation of the full distance matrix and the traceback to recover the optimal edit transcript.

TRENDING ON HEALTH (TOP)

Fick Method Underestimates Heart Problems in Children After Heart Transplant, Study Finds

For Women on Antidepressants, Creatine Showed a Possible Extra Boost

Did the iPhone Quietly Reshape When and Whether Americans Have Children?

For People Antidepressants Never Helped, a 30-Minute Home Session Is Now FDA-Approved

NOW ON AIR (RBC)

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Gaps in Sequence Alignment and Their Role in cDNA Matching

Local Alignment: Finding Substrings of High Similarity | Notes

Weighted Edit Distance Explained: Operation-Weight & Alphabet-Weight (Notes)

Dynamic Programming Approach to Edit Distance Calculation

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Gaps in Sequence Alignment and Their Role in cDNA Matching

Shibasis Rath

Related Posts

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Gaps in Sequence Alignment and Their Role in cDNA Matching

Local Alignment: Finding Substrings of High Similarity | Notes

Leave a Reply Cancel reply

POPULAR NEWS

Chewing gum releases thousands of microplastic particles directly into your mouth with every piece you chew

New Studys Says Gen Z is the least sexually active young cohort in modern recorded history

Yelling Isn’t Just Yelling: How a Hostile Home Rewires a Child’s Brain for Constant Alert

Is Gen Z the First Generation Less Intelligent Than Their Parents?

72-Hour Fasting Can Reset Your Entire Immune System, USC Study Shows

EDITOR CHOICE‘S

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Gaps in Sequence Alignment and Their Role in cDNA Matching

Local Alignment: Finding Substrings of High Similarity | Notes

Weighted Edit Distance Explained: Operation-Weight & Alphabet-Weight (Notes)

RathBiotaClan – RBC

Latest Posts

SHIBASIS RATH

Get Us On PlayStore

Welcome Back!

Create New Account!

Retrieve your password

Add New Playlist

TRENDING ON HEALTH (TOP)

NOW ON AIR (RBC)

Dynamic Programming Approach to Edit Distance Calculation

Definition of D(i, j)

Base Conditions

The Recurrence Relation

READ ALSO

Correctness of the Recurrence Relation

Lemma 1 — D(i, j) must be one of the three values

Lemma 2 — D(i, j) is at most the minimum of the three values

Theorem

Significance

Related Posts

Leave a Reply Cancel reply

POPULAR NEWS

EDITOR CHOICE‘S

RathBiotaClan – RBC

Latest Posts

SHIBASIS RATH

Get Us On PlayStore

Welcome Back!

Create New Account!

Retrieve your password

Add New Playlist