Local Alignment: Finding Substrings of High Similarity | Notes

Sequence comparison is one of the fundamental problems in computational biology and string algorithms. While global alignment compares two strings in their entirety, many practical situations demand comparison of only certain regions of two strings that show high similarity, even though the strings as a whole may be quite dissimilar. This requirement gives rise to the concept of local alignment, also known as the local similarity problem. Local alignment identifies a pair of substrings one from each of the two given strings — whose similarity is the maximum possible among all such pairs of substrings. This concept is of great importance in molecular biology, particularly in DNA and protein sequence analysis, where only specific conserved regions of long sequences may be biologically significant.

Definition of Local Alignment Problem

Local Alignment Problem: Given two strings S₁ and S₂, the task is to find substrings α and β of S₁ and S₂ respectively, such that the similarity (that is, the value of the optimal global alignment) between α and β is maximum over all possible pairs of substrings that can be chosen from S₁ and S₂. The value of this optimal solution is denoted by v*.

It is important to note that local alignment is defined in terms of similarity (a value to be maximised) rather than in terms of edit distance (a value to be minimised). If local alignment were defined using edit distance, the optimal pair of substrings under most natural scoring schemes would simply be exact matching substrings, which could be as short as a single character, and this would fail to identify any meaningful region of resemblance. By defining local alignment through a similarity scheme in which matches contribute positive scores while mismatches and spaces (gaps) contribute negative scores, the algorithm is guided towards discovering longer and more biologically meaningful regions of similarity.

Illustrative Example

Consider two strings:

S₁ = pqraxabcstvq
S₂ = xyaxbacsll

Let the scoring scheme assign a value of +2 for a match, −2 for a mismatch, and −1 for a space (gap).

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Gaps in Sequence Alignment and Their Role in cDNA Matching

Under this scheme, the substrings α = axabcs (from S₁) and β = axbacs (from S₂) yield the following optimal global alignment between themselves:

a x a b c s
a x b a c s

This alignment has a total value of 8, and it can be shown that no other pair of substrings taken one from each of S₁ and S₂ produces a higher similarity value under this scoring scheme. Hence, for this scoring scheme, the local alignment of S₁ and S₂ has optimal value v* = 8, defined by the substring pair (axabcs, axbacs).

Why Local Alignment Is Needed: Biological Justification

Global Alignment and Its Applicability

Global alignment is meaningful when two protein sequences belong to the same protein family and are expected to be similar over their entire length. For example:

Cytochrome c has nearly the same length across different organisms, and a meaningful relationship can be expected across the entire sequence length when comparing two species.
Similarly, proteins of the globin family, such as myoglobin and hemoglobin, show similarity across their complete lengths.

In such cases, where evolutionary history is deduced by comparing whole protein sequences of the same family, global alignment is effective and appropriate.

Situations Demanding Local Alignment

However, in many important biological contexts, local similarity is far more meaningful than global similarity:

Comparison of long anonymous DNA stretches — Only certain internal sections of long DNA strings may actually be related, while the rest of the sequence may show no correspondence at all.
Comparison of protein sequences from different families — Proteins belonging to different families are frequently composed of common structural or functional subunits, known as motifs or domains. Local alignment is the appropriate technique to search for these unknown, embedded subunits. Different proteins are often built from related motifs that form the protein’s structural core, while the surrounding surface-loop regions may differ considerably between proteins.
Homeobox genes — a classic example of conserved domains — Homeobox genes occur in a wide range of species, from fruit flies to frogs to humans, and regulate high-level embryonic development. A single mutation in these genes can transform one body part into another (for instance, causing fruit fly antennae to develop as legs). Although the overall protein sequences encoded by homeobox genes differ greatly between species, one region called the homeodomain — consisting of about sixty amino acids that bind to DNA — is highly conserved. Certain insect and mammalian homeodomains show 50% to 95% identity even without introducing gaps. Since this DNA-binding region is central to gene regulation, it remains highly conserved while the remainder of the protein varies considerably. This is a case where local alignment is clearly superior to global alignment for meaningful comparison.
Conservation of isolated critical residues — In many protein families, only a few isolated amino acids are highly conserved, such as the residues at an enzyme’s active site or those forming the hydrophobic core of a globular protein. Local alignment is more likely to detect such conserved residues than global alignment.
- Example: The family of serine proteases, characterised by a few isolated conserved amino acids.
- Example: The Helix-Turn-Helix (HTH) motif, common in DNA-binding transcription-regulating proteins, where the tenth position is very frequently occupied by glycine while the rest of the motif shows variability.
Chimeric proteins formed through evolution — As emphasised by C. Chothia, existing proteins have evolved not merely through point mutations, insertions, and deletions, but also through combination of genes to produce chimeric proteins, especially large proteins that appeared later in evolution. Such proteins are often built from combinations of a relatively small repertoire of selected protein domains. As Doolittle summarised, one must remain alert to regions of similarity even when embedded within an overall background of dissimilarity — precisely what local alignment is designed to detect.

A Balanced View

Although local alignment is regarded as the more appropriate technique for comparing proteins from different families, it has also been observed that pairs of proteins first identified through strong local similarity often also show extensive global similarity. Some studies further suggest that in certain situations, global alignment may expose important biological commonalities more effectively than local alignment. Thus, the choice between local and global alignment should depend on the biological context of the comparison.

Computational Approach to Local Alignment

Why Not Simply Use Global Alignment?

One might ask why local regions of high similarity cannot simply be identified by first performing a global alignment of the two strings. The answer is that although an optimal global alignment is influenced by regions of high similarity, such local regions can easily get “lost” within the overall optimal global alignment of long strings. Therefore, explicit and direct computation of local similarity is required.

Time Complexity

If S₁ and S₂ have lengths n and m respectively, the local alignment problem can be solved in O(nm) time — the same order of time required for global alignment. This is a remarkable result because there are Θ(n²m²) possible pairs of substrings between the two strings. Even if a global alignment could be computed in constant time for each pair, the overall time would still be Θ(n²m²). Using a naive O(kl) bound for aligning strings of length k and l would give a total time of O(n³m³). The O(nm) algorithm was developed by Temple Smith and Michael Waterman, and is presented below.

The Local Suffix Alignment Problem

To solve the local alignment problem efficiently, a more restricted sub-problem is first defined:

Local Suffix Alignment Problem: Given indices i ≤ n and j ≤ m, find a (possibly empty) suffix α of S₁[1..i] and a (possibly empty) suffix β of S₂[1..j] such that the alignment value V(α, β) is maximum over all such suffix pairs. This maximum value is denoted v(i, j).

Since both suffixes are permitted to be empty, v(i, j) is always greater than or equal to zero.

Example: Using a scoring scheme of +2 for a match and −1 for a mismatch or space, with S₁ = abcxdex and S₂ = xxxcde:

v(3, 4) = 2 (the two c’s match)
v(4, 5) = 1 (cx aligns with cd)
v(5, 5) = 3 (xcd aligns with xcd type alignment)
v(6, 6) = 5 (xcde aligns with xcde type alignment)

Relationship Between Local Alignment and Local Suffix Alignment

Theorem 1: v* = max{v(i, j) : i ≤ n, j ≤ m}.

Proof (in brief): v* ≥ max{v(i, j)} because any optimal solution to the local suffix alignment problem for a given i, j is itself a feasible solution to the general local alignment problem. Conversely, if α, β is the optimal solution to the local alignment problem, ending at positions i* and j*, then α, β also forms a local suffix alignment for the pair (i*, j*). Hence v* ≤ v(i*, j*) ≤ max{v(i, j)}. Both inequalities together establish the theorem.

Theorem 2: If (i′, j′) is the index pair that maximises v(i, j) over all pairs, then the substrings that solve the local suffix alignment problem for (i′, j′) also solve the local alignment problem.

These two theorems together show that solving the local suffix alignment problem for all index pairs and locating the maximum value directly yields the solution to the local alignment problem.

Recurrence Relation for Local Suffix Alignment

Base Case: v(i, 0) = 0 and v(0, j) = 0 for all i, j, since an empty suffix is always a valid choice.

Theorem 11.7.3 (Recurrence): For i > 0 and j > 0:

v(i, j) = max { 0,
               v(i−1, j−1) + s(S₁(i), S₂(j)),
               v(i−1, j) + s(S₁(i), −),
               v(i, j−1) + s(−, S₂(j)) }

Justification: Let α and β be the substrings whose global alignment gives the optimal local suffix alignment for (i, j).

Since α and β may be empty, 0 is always a valid candidate value.
If α is non-empty, then S₁(i) must be aligned either with S₂(j) or with a space.
If β is non-empty, then S₂(j) must be aligned either with S₁(i) or with a space.

Considering these possibilities exhaustively:

If S₁(i) aligns with S₂(j): v(i, j) = v(i−1, j−1) + s(S₁(i), S₂(j))
If S₁(i) aligns with a space: v(i, j) = v(i−1, j) + s(S₁(i), −)
If S₂(j) aligns with a space: v(i, j) = v(i, j−1) + s(−, S₂(j))

Since all cases are covered, v(i, j) must equal one of these four terms, and it must be the maximum among them, since each term corresponds to a feasible alignment. This establishes the recurrence.

Key Distinguishing Feature: The recurrence for local suffix alignment is almost identical to that of global alignment, differing only in the inclusion of the value zero as a candidate. This zero effectively allows the alignment to “restart” whenever the accumulated score becomes negative, so that leading characters that reduce similarity can simply be ignored (treated as an empty suffix) instead of dragging the score down.

Algorithm and Traceback

The method for computing v* consists of the following steps:

Construct the dynamic programming table for v(i, j) for all i ≤ n, j ≤ m using the above recurrence.
Locate the maximum value in the entire table, occurring at some cell (i*, j*).
As usual in dynamic programming, maintain pointers while filling the table to record which term produced each cell’s value.
Trace back the pointers from cell (i*, j*) until a cell with value zero is reached — this marks the start of the optimal local alignment.
The substrings that constitute the optimal local alignment are α = S₁[i′..i*] and β = S₂[j′..j*], where (i′, j′) is the cell at which the traceback terminates.

Time Complexity Analysis

Each cell of the table requires only four comparisons and three arithmetic operations, so filling the entire table requires O(nm) time. The search for the maximum value v* and the subsequent traceback also require O(nm) time. Hence:

Theorem 11.7.4: For two strings S₁ and S₂ of lengths n and m, the local alignment problem can be solved in O(nm) time — the same time complexity as the global alignment problem.

Theorem 11.7.5: All optimal local alignments of two strings are represented in the dynamic programming table for v(i, j), and every one of them can be recovered by tracing pointers back from any cell whose value equals v*.

Terminology: Global versus Local Alignment

Aspect	Global Alignment	Local Alignment
Also known as	Needleman–Wunsch alignment	Smith–Waterman alignment
Original algorithm’s time complexity	Cubic time (rarely used now)	Quadratic time (commonly used)
Objective	Align strings over their entire length	Align only the most similar substrings
Appropriate use	Proteins/sequences from the same family, similar length	Sequences from different families, long DNA stretches, conserved domains/motifs

There is some confusion in biological literature between the problem statement and the solution method named after these authors. “Needleman–Wunsch” today usually refers to the global alignment problem itself (since their original cubic-time solution is seldom used), whereas “Smith–Waterman” is used both for the local alignment problem and for the specific quadratic-time solution method, even though other solution methods to the same problem exist and are sometimes also loosely referred to as “Smith–Waterman.”

Finding Multiple Regions of High Similarity

In practical biological applications, it is often not sufficient to identify only a single best-matching pair of substrings. Instead, it may be necessary to find all or many pairs of substrings whose similarity exceeds a certain threshold. The dynamic programming table constructed for the local suffix alignment problem is commonly reused for this purpose: for any cell (i, j) in the table, a pair of substrings of S₁ and S₂ can be recovered through traceback, having a similarity equal to v(i, j). Thus, a practical approach is to identify all cells in the table whose value exceeds a chosen threshold and to trace back each of them to obtain multiple regions of high similarity. This approach does not guarantee that all significant regions will be found, but it is widely used in practice.

Importance of the Scoring Scheme

The effectiveness and biological meaningfulness of the optimal local alignment depend critically on the scoring scheme chosen:

If matches score +1 and mismatches/spaces score 0, the optimal local alignment reduces to the problem of finding the longest common subsequence.
If mismatches and spaces receive large negative scores while matches score +1, the optimal local alignment reduces to finding the longest common substring.

In most practical applications, neither of these extreme cases yields the local alignment of real biological interest. Careful, application-dependent choice of scoring scheme is required. A crucial general principle is that the average score of the scoring matrix entries must be negative; otherwise, the “local” optimal alignment computed will effectively behave like a global alignment, defeating the purpose of local alignment altogether. A well-developed mathematical theory exists (associated with database search applications) explaining precisely what scoring schemes for local alignment signify and how they should be derived.

Significance of Local Alignment

It allows detection of biologically important conserved regions (domains, motifs, active sites) even when the surrounding sequence shows no similarity.
It is essential for comparing sequences from different families that share only isolated functional or structural regions.
It is computationally as efficient as global alignment, at O(nm) time, despite considering all possible substring pairs.
It underlies major bioinformatics tools and database search methods used for sequence comparison.

Conclusion

Local alignment is a fundamental technique in string algorithms and computational biology for identifying regions of high similarity between two strings, even when the strings as a whole are dissimilar. Defined in terms of maximising similarity between chosen substrings, it is solved efficiently in O(nm) time through the Smith–Waterman dynamic programming algorithm, which closely resembles the global alignment recurrence except for the crucial inclusion of zero, which permits the alignment to restart at any point. Biologically, local alignment is indispensable for detecting conserved domains and motifs — such as the homeodomain of homeobox genes, serine protease active sites, and the Helix-Turn-Helix motif — that persist across otherwise divergent protein sequences. The correct choice of scoring scheme, ensuring a negative average score, is essential to obtaining meaningful local alignments rather than degenerating into either the longest common subsequence, the longest common substring, or an effective global alignment.

TRENDING ON HEALTH (TOP)

Fick Method Underestimates Heart Problems in Children After Heart Transplant, Study Finds

For Women on Antidepressants, Creatine Showed a Possible Extra Boost

Did the iPhone Quietly Reshape When and Whether Americans Have Children?

For People Antidepressants Never Helped, a 30-Minute Home Session Is Now FDA-Approved

NOW ON AIR (RBC)

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Gaps in Sequence Alignment and Their Role in cDNA Matching