Gaps in Sequence Alignment and Their Role in cDNA Matching

In the study of string alignment and dynamic programming, the basic elements used to evaluate an alignment are matches, mismatches, and spaces. However, these alone are not sufficient to produce alignments that are biologically meaningful. To overcome this limitation, the concept of a gap is introduced. Gaps allow an alignment to better reflect real biological mutational events and help produce alignments whose structure matches patterns actually observed in living organisms, particularly in DNA and protein sequences.

Definition of a Gap

A gap is defined as any maximal, consecutive run of spaces occurring in a single string of a given alignment. In simple words, when several spaces appear one after another (without interruption) in one string of the alignment, that entire stretch is treated as one single gap, rather than being counted as several individual spaces.

Certain rules apply while identifying a gap:

If a gap begins before the start of a string, it is bordered on its right side by the first character of that string.
If a gap begins after the end of a string, it is bordered on its left side by the last character of that string.
In all other cases, a gap must be bordered on both sides by actual characters of the string.
A gap can be as small as a single space, or it can extend to cover many consecutive spaces.

It is important to note that if the last space of one string happens to align with a space in the other string, these are two separate gaps (one in each string) and are not merged into a single gap, since they occur in different strings.

Gaps in the Objective Function

In the simplest scoring scheme that accounts for gaps, each gap is given a constant weight, denoted Wg, regardless of how many spaces it contains. This means that within a gap, each individual space is treated as “free” — that is, the score contribution of aligning a character with a space, or a space with a character, is taken as zero. The total value of an alignment is then calculated by adding a term for the number of gaps (k) multiplied by Wg, along with the usual terms for matches and mismatches.

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

Local Alignment: Finding Substrings of High Similarity | Notes

The value of Wg has a direct influence on the shape of the optimal alignment:

A large value of Wg discourages the formation of many gaps, so the alignment tends to have long, unbroken aligned regions.
A smaller value of Wg allows the alignment to be broken into more, shorter fragments, permitting a more fragmented alignment.

Thus, by tuning Wg, one can control how spaces are distributed across the alignment, which in turn determines whether the alignment appears as a few long blocks or many small scattered pieces.

Why Gaps Are Biologically Important

The biological reasoning behind introducing gaps is closely connected to the justification for local alignment. A gap of one string opposite a substring of the other string corresponds to either a deletion of that substring from the first string, or an insertion of that substring into the second string. Such insertion or deletion of an entire substring, especially in DNA, typically happens as the result of a single mutational event, rather than through many small individual mutations.

Several biological mechanisms are responsible for creating gaps of varying sizes:

Unequal crossing-over during meiosis – causes an insertion in one strand and a corresponding deletion in the other.
DNA slippage during replication – occurs when the replication machinery loses its position on the template, slips backward, and repeats a section, producing a duplicated stretch.
Insertion of transposable elements (“jumping genes”) – these mobile genetic elements insert themselves into DNA strings.
Insertion of DNA by retroviruses – viral genetic material becomes inserted into the host DNA.
Translocation of DNA between chromosomes – DNA segments move from one chromosome to another.

These mechanisms explain why long insertions and deletions, appearing as gaps, are common biological phenomena and must be modeled explicitly rather than treated as a series of unrelated single-space mutations.

Gaps and Evolutionary History

When determining the evolutionary relationship between species over long time periods, gaps often carry more informative value than simple substitutions. This is because point mutations (single character substitutions) occur very frequently and rapidly, whereas mutational events that create gaps occur much less frequently. As a result, corresponding genes in two different species may show large differences at the level of individual character substitutions, making it difficult to establish evolutionary relationships using substitution-based similarity alone. However, since large insertions and deletions are rarer events, shared gaps in aligned strings can serve as reliable markers for reconstructing evolutionary history and are sometimes used as “evolutionary characters” while building evolutionary trees.

Gaps and Protein Domains

At the protein level, many proteins are built from combinations of protein domains selected from a limited repertoire. Consequently, two protein sequences may be highly similar over certain stretches but differ where one protein contains a domain absent in the other. Such a region, where one protein has an extra domain, naturally appears as a gap when the two proteins are aligned. In fact, many biologists regard the correct identification of these long (“major”) gaps as the central challenge of protein alignment — once major gaps are correctly placed, the remaining alignment (reflecting ordinary point mutations) becomes relatively straightforward.

cDNA Matching: An Illustration of Gap Importance

Biological Background

In eukaryotic organisms, a gene consists of alternating exons (expressed, protein-coding sequences) and introns(intervening, non-coding sequences). While the number of exons/introns per gene is usually modest, introns can be far longer than exons.

The process of protein synthesis from a eukaryotic gene proceeds as follows:

An RNA transcript is first produced from the gene’s DNA, covering both introns and exons, with base complementation (A→U, T→A, C→G, G→C).
The intron-exon boundaries are identified, and introns are spliced out of the transcript.
The remaining exon sequences are joined together to form messenger RNA (mRNA), which leaves the nucleus and directs protein synthesis.

Since only a fraction of genes are expressed in any specific cell type, capturing the mRNA of a particular cell allows researchers to determine which genes are active. This captured mRNA is converted into a complementary DNA strand known as cDNA, which unlike the original gene contains only the exon sequences (introns removed). Building comprehensive cDNA libraries for various cell types formed a major component of the Human Genome Project and also led to disputes regarding patenting of cDNA sequences.

The Matching Problem

Once cDNA is obtained, the challenge is to locate the corresponding gene within a long, sequenced stretch of genomic DNA. This becomes a string alignment problem: aligning the shorter cDNA string against the longer genomic DNA string in a manner that correctly reveals the locations of the exons. Since the cDNA lacks introns, the expected alignment consists of a few short regions of very high similarity (matching exons) separated by long gaps (corresponding to introns). Some mismatches and spaces may occur within the matching regions due to sequencing errors, but these should form only a small percentage of each region.

Why an Objective Function Without Gaps Fails

If the scoring scheme includes only matches, mismatches, and spaces (with no explicit gap term), the resulting alignment does not correctly capture the exon structure. The reasoning is as follows:

A low space penalty is required so that the alignment is not forced to keep the cDNA compressed together, since long intron gaps must be permitted.
A high mismatch penalty is required, since only a small number of mismatches (due to sequencing errors) are expected.
Given these settings, but with no gap term, the optimal alignment tends to become simply the longest common subsequence (LCS) between the cDNA and the long genomic string.
Since DNA has only four possible characters and introns are long, this LCS is likely to match nearly all characters of the cDNA, producing a higher score than the true, biologically correct exon-matching alignment (which would leave a few mismatches due to sequencing errors).
However, this LCS-based alignment scatters the cDNA thinly across the entire genomic sequence instead of correctly picking out the compact exon regions, thereby failing to reflect true biology.

The Solution: Adding a Gap Term

By introducing a constant gap weight Wg into the objective function for each gap present, and tuning Wg appropriately, the optimal alignment can be made to correctly cut the cDNA into segments matching its exons in the longer DNA sequence. This demonstrates the practical necessity of the gap concept for solving real biological alignment problems.

Processed Pseudogenes

A more complex variant of this problem involves pseudogenes — near-copies of functional genes that have mutated enough to lose their function, and which are common in eukaryotes. Pseudogenes are believed to arise through gene duplication followed by mutation, representing either failed trial genes or potential future genes. A pseudogene typically retains both introns and exons of its original gene and may be located far from, or even on a different chromosome than, its parent gene.

A more specific and interesting case is the processed pseudogene, which contains only exon sequences (introns already removed), formed when an mRNA is reverse-transcribed back into DNA (via the enzyme Reverse Transcriptase) and randomly inserted into the genome. Locating such processed pseudogenes is similar to, but more difficult than, cDNA matching, since the actual cDNA sequence is not available in hand; it requires repeat-finding methods, local alignment techniques, and careful gap-weight selection.

Caveat: Alternative Practical Approach

Although the gap-weighted alignment model is theoretically important, in practice, cDNA and pseudogene matching problems are often approached using local alignment without explicit gap weighting. Local alignment algorithms can identify multiple highly similar substring pairs (not just the single best pair). In the cDNA/pseudogene context, these individual highly similar pairs typically correspond to the exons themselves, and the complete match between cDNA and the gene can be reconstructed by piecing together several non-overlapping local alignments. This method is, in fact, the more commonly used approach in practical applications.

Conclusion

The introduction of gaps as a distinct construct in sequence alignment allows the alignment model to reflect real biological mutational events such as insertions, deletions, transposon activity, and DNA slippage, which typically occur as single events affecting long stretches of sequence rather than as many independent single-character changes. By assigning an appropriate gap weight (Wg) in the objective function, alignments can be guided to correctly represent structures such as exon-intron boundaries in cDNA matching, evolutionary relationships through shared gaps, and protein domain differences. This makes the concept of gaps essential for producing alignments that are both computationally optimal and biologically realistic.

TRENDING ON HEALTH (TOP)

Fick Method Underestimates Heart Problems in Children After Heart Transplant, Study Finds

For Women on Antidepressants, Creatine Showed a Possible Extra Boost

Did the iPhone Quietly Reshape When and Whether Americans Have Children?

For People Antidepressants Never Helped, a 30-Minute Home Session Is Now FDA-Approved

NOW ON AIR (RBC)

Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment