Sequence comparison is one of the fundamental problems in computational biology and string algorithms. While global alignment compares two strings in their entirety, many practical situations demand comparison of only certain regions of two strings that show high similarity, even though the strings as a whole may be quite dissimilar. This requirement gives rise to the concept ofย local alignment, also known as theย local similarity problem. Local alignment identifies a pair of substrings one from each of the two given strings โ whose similarity is the maximum possible among all such pairs of substrings. This concept is of great importance in molecular biology, particularly in DNA and protein sequence analysis, where only specific conserved regions of long sequences may be biologically significant.
Definition of Local Alignment Problem
Local Alignment Problem: Given two strings Sโ and Sโ, the task is to find substrings ฮฑ and ฮฒ of Sโ and Sโ respectively, such that the similarity (that is, the value of the optimal global alignment) between ฮฑ and ฮฒ is maximum over all possible pairs of substrings that can be chosen from Sโ and Sโ. The value of this optimal solution is denoted by v*.
It is important to note that local alignment is defined in terms of similarity (a value to be maximised) rather than in terms of edit distance (a value to be minimised). If local alignment were defined using edit distance, the optimal pair of substrings under most natural scoring schemes would simply be exact matching substrings, which could be as short as a single character, and this would fail to identify any meaningful region of resemblance. By defining local alignment through a similarity scheme in which matches contribute positive scores while mismatches and spaces (gaps) contribute negative scores, the algorithm is guided towards discovering longer and more biologically meaningful regions of similarity.
Illustrative Example
Consider two strings:
- Sโ = pqraxabcstvq
- Sโ = xyaxbacsll
Let the scoring scheme assign a value of +2 for a match, โ2 for a mismatch, and โ1 for a space (gap).
Under this scheme, the substrings ฮฑ = axabcs (from Sโ) and ฮฒ = axbacs (from Sโ) yield the following optimal global alignment between themselves:
a x a b c s
a x b a c s
This alignment has a total value of 8, and it can be shown that no other pair of substrings taken one from each of Sโ and Sโ produces a higher similarity value under this scoring scheme. Hence, for this scoring scheme, the local alignment of Sโ and Sโ has optimal value v* = 8, defined by the substring pair (axabcs, axbacs).
Why Local Alignment Is Needed: Biological Justification
Global Alignment and Its Applicability
Global alignment is meaningful when two protein sequences belong to the same protein family and are expected to be similar over their entire length. For example:
- Cytochrome cย has nearly the same length across different organisms, and a meaningful relationship can be expected across the entire sequence length when comparing two species.
- Similarly, proteins of theย globin family, such as myoglobin and hemoglobin, show similarity across their complete lengths.
In such cases, where evolutionary history is deduced by comparing whole protein sequences of the same family, global alignment is effective and appropriate.
Situations Demanding Local Alignment
However, in many important biological contexts, local similarity is far more meaningful than global similarity:
- Comparison of long anonymous DNA stretchesย โ Only certain internal sections of long DNA strings may actually be related, while the rest of the sequence may show no correspondence at all.
- Comparison of protein sequences from different familiesย โ Proteins belonging to different families are frequently composed of common structural or functional subunits, known asย motifsย orย domains. Local alignment is the appropriate technique to search for these unknown, embedded subunits. Different proteins are often built from related motifs that form the protein’s structural core, while the surrounding surface-loop regions may differ considerably between proteins.
- Homeobox genes โ a classic example of conserved domainsย โ Homeobox genes occur in a wide range of species, from fruit flies to frogs to humans, and regulate high-level embryonic development. A single mutation in these genes can transform one body part into another (for instance, causing fruit fly antennae to develop as legs). Although the overall protein sequences encoded by homeobox genes differ greatly between species, one region called theย homeodomainย โ consisting of about sixty amino acids that bind to DNA โ is highly conserved. Certain insect and mammalian homeodomains show 50% to 95% identity even without introducing gaps. Since this DNA-binding region is central to gene regulation, it remains highly conserved while the remainder of the protein varies considerably. This is a case where local alignment is clearly superior to global alignment for meaningful comparison.
- Conservation of isolated critical residuesย โ In many protein families, only a few isolated amino acids are highly conserved, such as the residues at an enzyme’s active site or those forming the hydrophobic core of a globular protein. Local alignment is more likely to detect such conserved residues than global alignment.
- Example: The family ofย serine proteases, characterised by a few isolated conserved amino acids.
- Example: Theย Helix-Turn-Helix (HTH) motif, common in DNA-binding transcription-regulating proteins, where the tenth position is very frequently occupied by glycine while the rest of the motif shows variability.
- Chimeric proteins formed through evolutionย โ As emphasised by C. Chothia, existing proteins have evolved not merely through point mutations, insertions, and deletions, but also through combination of genes to produce chimeric proteins, especially large proteins that appeared later in evolution. Such proteins are often built from combinations of a relatively small repertoire of selected protein domains. As Doolittle summarised, one must remain alert to regions of similarity even when embedded within an overall background of dissimilarity โ precisely what local alignment is designed to detect.
A Balanced View
Although local alignment is regarded as the more appropriate technique for comparing proteins from different families, it has also been observed that pairs of proteins first identified through strong local similarity often also show extensive global similarity. Some studies further suggest that in certain situations, global alignment may expose important biological commonalities more effectively than local alignment. Thus, the choice between local and global alignment should depend on the biological context of the comparison.
Computational Approach to Local Alignment
Why Not Simply Use Global Alignment?
One might ask why local regions of high similarity cannot simply be identified by first performing a global alignment of the two strings. The answer is that although an optimal global alignment is influenced by regions of high similarity, such local regions can easily get “lost” within the overall optimal global alignment of long strings. Therefore, explicit and direct computation of local similarity is required.
Time Complexity
If Sโ and Sโ have lengths n and m respectively, the local alignment problem can be solved in O(nm) time โ the same order of time required for global alignment. This is a remarkable result because there are ฮ(nยฒmยฒ) possible pairs of substrings between the two strings. Even if a global alignment could be computed in constant time for each pair, the overall time would still be ฮ(nยฒmยฒ). Using a naive O(kl) bound for aligning strings of length k and l would give a total time of O(nยณmยณ). The O(nm) algorithm was developed by Temple Smith and Michael Waterman, and is presented below.
The Local Suffix Alignment Problem
To solve the local alignment problem efficiently, a more restricted sub-problem is first defined:
Local Suffix Alignment Problem: Given indices i โค n and j โค m, find a (possibly empty) suffix ฮฑ of Sโ[1..i] and a (possibly empty) suffix ฮฒ of Sโ[1..j] such that the alignment value V(ฮฑ, ฮฒ) is maximum over all such suffix pairs. This maximum value is denoted v(i, j).
Since both suffixes are permitted to be empty, v(i, j) is always greater than or equal to zero.
Example: Using a scoring scheme of +2 for a match and โ1 for a mismatch or space, with Sโ = abcxdex and Sโ = xxxcde:
- v(3, 4) = 2 (the two c’s match)
- v(4, 5) = 1 (cx aligns with cd)
- v(5, 5) = 3 (xcd aligns with xcd type alignment)
- v(6, 6) = 5 (xcde aligns with xcde type alignment)
Relationship Between Local Alignment and Local Suffix Alignment
Theorem 1:ย v* = max{v(i, j) : i โค n, j โค m}.
Proof (in brief): v* โฅ max{v(i, j)} because any optimal solution to the local suffix alignment problem for a given i, j is itself a feasible solution to the general local alignment problem. Conversely, if ฮฑ, ฮฒ is the optimal solution to the local alignment problem, ending at positions i* and j*, then ฮฑ, ฮฒ also forms a local suffix alignment for the pair (i*, j*). Hence v* โค v(i*, j*) โค max{v(i, j)}. Both inequalities together establish the theorem.
Theorem 2:ย If (iโฒ, jโฒ) is the index pair that maximises v(i, j) over all pairs, then the substrings that solve the local suffix alignment problem for (iโฒ, jโฒ) also solve the local alignment problem.
These two theorems together show that solving the local suffix alignment problem for all index pairs and locating the maximum value directly yields the solution to the local alignment problem.
Recurrence Relation for Local Suffix Alignment
Base Case: v(i, 0) = 0 and v(0, j) = 0 for all i, j, since an empty suffix is always a valid choice.
Theorem 11.7.3 (Recurrence): For i > 0 and j > 0:
v(i, j) = max { 0,
v(iโ1, jโ1) + s(Sโ(i), Sโ(j)),
v(iโ1, j) + s(Sโ(i), โ),
v(i, jโ1) + s(โ, Sโ(j)) }
Justification: Let ฮฑ and ฮฒ be the substrings whose global alignment gives the optimal local suffix alignment for (i, j).
- Since ฮฑ and ฮฒ may be empty, 0 is always a valid candidate value.
- If ฮฑ is non-empty, then Sโ(i) must be aligned either with Sโ(j) or with a space.
- If ฮฒ is non-empty, then Sโ(j) must be aligned either with Sโ(i) or with a space.
Considering these possibilities exhaustively:
- If Sโ(i) aligns with Sโ(j): v(i, j) = v(iโ1, jโ1) + s(Sโ(i), Sโ(j))
- If Sโ(i) aligns with a space: v(i, j) = v(iโ1, j) + s(Sโ(i), โ)
- If Sโ(j) aligns with a space: v(i, j) = v(i, jโ1) + s(โ, Sโ(j))
Since all cases are covered, v(i, j) must equal one of these four terms, and it must be the maximum among them, since each term corresponds to a feasible alignment. This establishes the recurrence.
Key Distinguishing Feature: The recurrence for local suffix alignment is almost identical to that of global alignment, differing only in the inclusion of the value zero as a candidate. This zero effectively allows the alignment to “restart” whenever the accumulated score becomes negative, so that leading characters that reduce similarity can simply be ignored (treated as an empty suffix) instead of dragging the score down.
Algorithm and Traceback
The method for computing v* consists of the following steps:
- Construct the dynamic programming table for v(i, j) for all i โค n, j โค m using the above recurrence.
- Locate the maximum value in the entire table, occurring at some cell (i*, j*).
- As usual in dynamic programming, maintain pointers while filling the table to record which term produced each cell’s value.
- Trace back the pointers from cell (i*, j*) until a cell with valueย zeroย is reached โ this marks the start of the optimal local alignment.
- The substrings that constitute the optimal local alignment are ฮฑ = Sโ[iโฒ..i*] and ฮฒ = Sโ[jโฒ..j*], where (iโฒ, jโฒ) is the cell at which the traceback terminates.
Time Complexity Analysis
Each cell of the table requires only four comparisons and three arithmetic operations, so filling the entire table requires O(nm) time. The search for the maximum value v* and the subsequent traceback also require O(nm) time. Hence:
Theorem 11.7.4: For two strings Sโ and Sโ of lengths n and m, the local alignment problem can be solved in O(nm) time โ the same time complexity as the global alignment problem.
Theorem 11.7.5: All optimal local alignments of two strings are represented in the dynamic programming table for v(i, j), and every one of them can be recovered by tracing pointers back from any cell whose value equals v*.
Terminology: Global versus Local Alignment
| Aspect | Global Alignment | Local Alignment |
|---|---|---|
| Also known as | NeedlemanโWunsch alignment | SmithโWaterman alignment |
| Original algorithm’s time complexity | Cubic time (rarely used now) | Quadratic time (commonly used) |
| Objective | Align strings over their entire length | Align only the most similar substrings |
| Appropriate use | Proteins/sequences from the same family, similar length | Sequences from different families, long DNA stretches, conserved domains/motifs |
There is some confusion in biological literature between the problem statement and the solution method named after these authors. “NeedlemanโWunsch” today usually refers to the global alignment problem itself (since their original cubic-time solution is seldom used), whereas “SmithโWaterman” is used both for the local alignment problem and for the specific quadratic-time solution method, even though other solution methods to the same problem exist and are sometimes also loosely referred to as “SmithโWaterman.”
Finding Multiple Regions of High Similarity
In practical biological applications, it is often not sufficient to identify only a single best-matching pair of substrings. Instead, it may be necessary to find all or many pairs of substrings whose similarity exceeds a certain threshold. The dynamic programming table constructed for the local suffix alignment problem is commonly reused for this purpose: for any cell (i, j) in the table, a pair of substrings of Sโ and Sโ can be recovered through traceback, having a similarity equal to v(i, j). Thus, a practical approach is to identify all cells in the table whose value exceeds a chosen threshold and to trace back each of them to obtain multiple regions of high similarity. This approach does not guarantee that all significant regions will be found, but it is widely used in practice.
Importance of the Scoring Scheme
The effectiveness and biological meaningfulness of the optimal local alignment depend critically on the scoring scheme chosen:
- Ifย matches score +1ย andย mismatches/spaces score 0, the optimal local alignment reduces to the problem of finding theย longest common subsequence.
- Ifย mismatches and spaces receive large negative scoresย while matches score +1, the optimal local alignment reduces to finding theย longest common substring.
In most practical applications, neither of these extreme cases yields the local alignment of real biological interest. Careful, application-dependent choice of scoring scheme is required. A crucial general principle is that the average score of the scoring matrix entries must be negative; otherwise, the “local” optimal alignment computed will effectively behave like a global alignment, defeating the purpose of local alignment altogether. A well-developed mathematical theory exists (associated with database search applications) explaining precisely what scoring schemes for local alignment signify and how they should be derived.
Significance of Local Alignment
- It allows detection of biologically important conserved regions (domains, motifs, active sites) even when the surrounding sequence shows no similarity.
- It is essential for comparing sequences from different families that share only isolated functional or structural regions.
- It is computationally as efficient as global alignment, at O(nm) time, despite considering all possible substring pairs.
- It underlies major bioinformatics tools and database search methods used for sequence comparison.
Conclusion
Local alignment is a fundamental technique in string algorithms and computational biology for identifying regions of high similarity between two strings, even when the strings as a whole are dissimilar. Defined in terms of maximising similarity between chosen substrings, it is solved efficiently in O(nm) time through the SmithโWaterman dynamic programming algorithm, which closely resembles the global alignment recurrence except for the crucial inclusion of zero, which permits the alignment to restart at any point. Biologically, local alignment is indispensable for detecting conserved domains and motifs โ such as the homeodomain of homeobox genes, serine protease active sites, and the Helix-Turn-Helix motif โ that persist across otherwise divergent protein sequences. The correct choice of scoring scheme, ensuring a negative average score, is essential to obtaining meaningful local alignments rather than degenerating into either the longest common subsequence, the longest common substring, or an effective global alignment.










