Choices for Gap Weights: Constant, Affine & Arbitrary Gap Weights in Sequence Alignment

In alignment problems, the use of gaps in the objective function helps in finding alignments that satisfy an expected shape, as seen in examples like cDNA matching. However, the effectiveness of the gap concept depends critically on how gaps are weighted. There are four general types of gap weights: constant, affine, convex, and arbitrary.

1. Constant Gap Weight

This is the simplest choice of gap weight. Here, each individual space is free, and each gap (regardless of the number of spaces in it) is given a fixed weight of Wg.

If Wm and Wms denote the weights for matches and mismatches respectively, the operator-weight version of the problem is stated as:

Find an alignment A to maximize [Wm(# matches) − Wms(# mismatches) − Wg(# gaps)].

If alphabet-dependent weights are used instead (a similarity function for matches/mismatches), the objective becomes maximizing the sum of similarity scores s(x, y) over aligned characters, minus Wg times the number of gaps, where s(x, −) = s(−, x) = 0 for every character x.

Gaps in Sequence Alignment and Their Role in cDNA Matching

Local Alignment: Finding Substrings of High Similarity | Notes

2. Affine Gap Weight

This is a generalization of the constant gap model, obtained by adding a weight Ws for each space within the gap.

Wg is called the gap initiation weight — it represents the cost of starting a gap.
Ws is called the gap extension weight — it represents the cost of extending the gap by one space.

The operator-weight version of the objective becomes:

Find an alignment to maximize [Wm(# matches) − Wms(# mismatches) − Wg(# gaps) − Ws(# spaces)].

This model is called the affine gap weight model because the weight contributed by a single gap of length q is given by the affine function:

Wg + qWs

The constant gap weight model is simply the affine model with Ws = 0. (The affine model is sometimes also called the linear weight model.)

The alphabet-weight version again sets s(x, −) = s(−, x) = 0, with the objective of maximizing the sum of similarity scores minus Wg(# gaps) minus Ws(# spaces).

The affine gap weight model is the most commonly used gap model in molecular biology literature, although there is considerable disagreement about the correct values of Wg and Ws (and also Wm and Wms). For example, the widely used search program FASTA uses default settings of Wg = 10 and Ws = 2 for aligning amino acid strings.

3. Convex Gap Weight

It has been suggested that certain biological phenomena are better modeled by a gap weight function where each additional space in a gap contributes less to the total weight than the preceding space — that is, a function with a negative second derivative. Such a function is called convex (some call it concave), as opposed to affine.

An example of such a function is:

Wg + logₑ q

where q is the length of the gap. Some biologists have further suggested that a gap function which initially increases to a maximum value and then decreases to near zero would best reflect a combination of different biological phenomena responsible for insertion or deletion of DNA.

4. Arbitrary Gap Weight

This is the most general gap weight model, where the weight of a gap is an arbitrary function w(q) of its length q. The constant, affine, and convex weight models are all special cases of this arbitrary weight model.

Time Bounds for Gap Choices

The time needed to solve the alignment problem optimally depends on the type of gap weight chosen:

If w(q) is a totally arbitrary function of gap length, the optimal alignment can be found in O(nm² + n²m) time, where n and m (m ≥ n) are the lengths of the two strings.
If w(q) is convex, the time can be reduced to O(nm log m).
If w(q) is affine (or constant), the time bound reduces to O(nm) — the same time bound as for alignment without any gap concept at all.

Recurrences for Arbitrary Gap Weights

To align strings S1 and S2, prefixes S1[1..i] and S2[1..j] are considered. Any alignment of these two prefixes falls into one of three types:

Alignments where S1(i) is aligned to a character strictly to the left of S2(j) — the alignment ends with a gap in S1.
Alignments where S1(i) is aligned strictly to the right of S2(j) — the alignment ends with a gap in S2.
Alignments where S1(i) and S2(j) are aligned directly opposite each other (whether matching or mismatching).

These three cases cover all possibilities. Accordingly:

E(i, j) = maximum value of any type 1 alignment
F(i, j) = maximum value of any type 2 alignment
G(i, j) = maximum value of any type 3 alignment
V(i, j) = maximum of E(i, j), F(i, j), G(i, j)

The recurrences are:

V(i, j) = max[E(i, j), F(i, j), G(i, j)]

G(i, j) = V(i−1, j−1) + s(S1(i), S2(j))

E(i, j) = max₀≤k<j [V(i, k) − w(j − k)]

F(i, j) = max₀≤k<i [V(k, j) − w(i − k)]

Base cases — if all spaces (including end spaces) are counted in the objective function, the optimal value is found in cell (n, m), with:

V(i, 0) = −w(i), V(0, j) = −w(j) E(i, 0) = −w(i), F(0, j) = −w(j)

Here G(0,0) = 0, but G(i, j) is undefined when exactly one of i or j is zero, and V(0,0) is naturally taken as zero.

When end gaps are free, the optimal alignment value is the maximum value found in row n or column m, with base case V(0, j) = V(i, 0) = 0.

Time Analysis: Filling the (n+1) × (m+1) table row by row, evaluating G(i, j) requires examining one cell, evaluating E(i, j) requires examining j cells of row i, and evaluating F(i, j) requires examining i cells of column j. This gives O(m²) work per row for E-values and O(n²) work per column for F-values, leading to an overall time bound of O(nm² + n²m).

This increase over the gap-free case (O(nm)) occurs because V(i, j) now depends on looking j cells to the left and i cells above, rather than just the three adjacent cells.

Recurrences for Affine (and Constant) Gap Weights

The key insight for greater efficiency in the affine case is that the increase in gap weight contributed by each additional space is a constant Ws, independent of the gap’s length so far — that is, w(q+1) − w(q) = Ws for any gap length q > 0. This differs from the arbitrary case, where no such fixed relationship exists between w(q) and w(q+1).

Because of this, when evaluating E(i, j) or F(i, j), it is not necessary to track exactly where the gap began — only whether a gap has already started or a new one is being initiated.

Base cases — when end gaps are included:

V(i, 0) = E(i, 0) = −Wg − iWs

V(0, j) = F(0, j) = −Wg − jWs

When end gaps are free: V(i, 0) = V(0, j) = 0.

General recurrences:

V(i, j) = max[E(i, j), F(i, j), G(i, j)]

G(i, j) = V(i−1, j−1) + Wm (score depending on match/mismatch)

E(i, j) = max[E(i, j−1), V(i, j−1) − Wg] − Ws

F(i, j) = max[F(i−1, j), V(i−1, j) − Wg] − Ws

Explanation of E(i, j): Since S1(i) is aligned to the left of S2(j), either:

S1(i) is exactly one place to the left of S2(j), so a new gap begins opposite S2(j), giving E(i, j) = V(i, j−1) − Wg − Ws, or
S1(i) is to the left of S2(j−1), so the same gap already in progress is now opposite S2(j) as well, giving E(i, j) = E(i, j−1) − Ws.

F(i, j) is explained analogously, and G(i, j) simply represents aligning S1(i) directly opposite S2(j).

The optimal alignment value is found in cell (n, m) if end spaces are counted; otherwise it is the maximum value in row n or column m.

Note: It may be asked why V(i, j−1), and not G(i, j−1), is used in the recurrence for E(i, j). Using G(i, j−1) − Wg would miss alignments having a gap in S2 ending opposite character i of S1, followed immediately by a gap in S1. Although the fully expanded recurrence — max[E(i, j−1), G(i, j−1) − Wg, V(i, j−1) − Wg] − Ws — would also be correct, the middle term is redundant since it is already covered by V(i, j−1) − Wg.

Time Analysis: Since V(i, j), E(i, j), F(i, j), and G(i, j) at each cell are each evaluated using only a constant number of references to previously computed values, arithmetic operations, and comparisons, the entire (n+1) × (m+1) table can be filled in O(nm) time — the same as the gap-free alignment model.

Conclusion

Thus, gap weighting in alignment problems ranges from the simplest constant model to the most general arbitrary model, with affine weights forming the standard choice in practice due to their balance between biological interpretability and computational efficiency. While arbitrary and convex gap weights allow more flexible modeling of biological insertion/deletion phenomena, they come at increased computational cost, whereas the affine (and constant) model achieves the same O(nm) time bound as alignment without gaps at all.

TRENDING ON HEALTH (TOP)

Fick Method Underestimates Heart Problems in Children After Heart Transplant, Study Finds

For Women on Antidepressants, Creatine Showed a Possible Extra Boost

Did the iPhone Quietly Reshape When and Whether Americans Have Children?

For People Antidepressants Never Helped, a 30-Minute Home Session Is Now FDA-Approved

NOW ON AIR (RBC)