Restriction endonucleases are enzymes that recognize and cleave specific short sequences in DNA, known as restriction sites. Predicting the number and distribution of these sites along a DNA molecule is an important problem in molecular biology and bioinformatics. To approach this mathematically, a statistical model of the DNA sequence is required. The simplest and most widely used model treats the DNA sequence as a string of independently and identically distributed (iid) letters, and from this foundation, probability theory can be applied to estimate the occurrence of restriction sites.
1. The iid Model for DNA Sequences
When a DNA sample is analyzed, certain properties are typically known — the organism of origin, base composition (%G+C content), and approximate molecular weight. However, detailed sequence information may be unavailable. In such cases, the DNA sequence is modeled as a string of iid letters, meaning each nucleotide position is assumed to be occupied by one of the four bases (A, T, G, C) independently, with probabilities determined by the known base composition of the DNA.
This is the simplest possible model and serves as the starting point for analysis of restriction site distributions.
2. Probability of a Restriction Site
Let the DNA sequence be of length n, and let the recognition sequence of the restriction endonuclease have length t(commonly 4, 6, or 8 base pairs). A random variable X_i is defined at each position i as follows:
- X_i = 1, if position i is the start of a restriction site
- X_i = 0, if it is not
The probability that any position is the start of a restriction site is denoted p. Under the iid model, the bases at successive positions are independent, so p is calculated using the multiplication rule:
p = P(base₁) × P(base₂) × … × P(base_t)
Example: For the enzyme EcoRI, the recognition sequence is 5′-GAATTC-3′. If all four bases are equally frequent (each with probability 0.25), then:
p = (0.25)⁶ ≈ 0.00024
This value of p is very small, a fact that has significant consequences for the probability distribution used to model site counts.
3. Total Number of Restriction Sites
The total number of restriction sites, N, in a DNA molecule of length n is given by:
N = X₁ + X₂ + … + X_m, where m = n − (t − 1)
The sum runs to m rather than n because a site of length t cannot begin in the last (t − 1) positions of the molecule. However, since t is much smaller than n, this end effect is negligible, and for simplicity m ≈ n is used.
If the X_i were truly independent, N would follow a binomial distribution with parameters n and p, giving:
- Expected number of sites: E(N) = np
- Variance: Var(N) = np(1 − p)
In practice, consecutive X_i values are not strictly independent due to overlaps between successive recognition windows. Despite this, the binomial approximation performs well in most practical cases.
4. Validation with Experimental Data (Bacteriophage Lambda)
The iid model can be tested by comparing predicted site counts with observed counts from real DNA sequences. For bacteriophage lambda (48,502 bp), with observed base frequencies p_A = p_T = 0.2507 and p_C = p_G = 0.2493, predictions were computed for 10 four-base-pair palindromic recognition sequences and compared with counts from the actual sequence (GenBank file NC_001416).
Key findings:
- For most enzymes (e.g., MseI, NlaIII), the observed number of sites was close to the predicted value (~190), confirming the adequacy of the iid model.
- For a few enzymes (e.g., BfaI with only 13 observed sites vs. 190 predicted; HpaII with 328 observed), the deviation exceeded three standard deviations (SD ≈ 14), suggesting these sequences are either over- or under-represented.
- Such deviations may reflect biological factors such as DNA repair mechanisms or methylation patterns specific to the organism.
This comparison demonstrates that while the iid model is a simplification, it provides reliable predictions for most restriction enzymes and serves as a useful null model.
5. The Poisson Approximation to the Binomial
When n is large and p is small (as is the case for restriction sites), computing exact binomial probabilities becomes cumbersome. In such situations, the binomial distribution is well approximated by the Poisson distribution.
Derivation
Starting from the binomial probability formula:
P(N = j) = [n! / ((n−j)! j!)] × p^j × (1−p)^(n−j)
Setting λ = np and using the approximations valid when j ≪ n and p ≪ 1:
- n(n−1)…(n−j+1) ≈ nʲ
- (1−p)^j ≈ 1
- (1 − λ/n)^n → e^(−λ) as n → ∞
The binomial probability simplifies to:
P(N = j) ≈ (λʲ / j!) × e^(−λ), j = 0, 1, 2, …
This is the Poisson distribution with parameter λ = np.
Properties of the Poisson Distribution
- Mean: E(N) = λ
- Variance: Var(N) = λ
- Mean and variance are equal, which is a defining feature of the Poisson distribution.
Worked Example
For EcoRI with p = 0.00024 on a DNA molecule of length n = 10,000:
λ = np = 10,000 × 0.00024 = 2.4
P(N ≤ 2) = P(N=0) + P(N=1) + P(N=2) = e^(−2.4)[1 + 2.4 + (2.4²/2)] ≈ 0.5697
Interpretation: More than 50% of DNA molecules of this length with uniform base frequencies will be cut by EcoRI at two or fewer sites. This result can also be computed using the R command ppois(2, 2.4).
6. The Poisson Process
The Poisson distribution can be generalized into a Poisson process, which models the occurrence of events (such as restriction sites) along a continuous line (the DNA molecule) at a constant rate μ.
The probability of observing k events in an interval of length l is:
P(k events in (x, x+l)) = e^(−μl) × (μl)^k / k!
Key properties of the Poisson process:
- Events occur uniformly and independently along the molecule.
- For disjoint intervals of lengths l₁ and l₂, the total number of events follows the same formula with total length (l₁ + l₂).
- The mean number of events is length × rate = μl.
- The concept extends naturally to two-dimensional (area) or three-dimensional (volume) processes. For example, lightning strikes per unit area in a region can be modeled as a Poisson process.
Summary
| Concept | Key Formula / Result |
|---|---|
| Restriction site probability | p = product of individual base probabilities |
| Total sites | N = X₁ + X₂ + … + Xₙ, Binomial(n, p) |
| Expected sites | E(N) = np |
| Variance | Var(N) = np(1−p) |
| Poisson approximation | P(N=j) = (λʲ/j!) e^(−λ), λ = np |
| Poisson process | P(k in length l) = e^(−μl)(μl)^k / k! |
The iid model, combined with the Poisson approximation, provides a mathematically tractable and experimentally validated framework for predicting restriction site distributions in DNA. This approach forms the basis for more advanced analyses of fragment length distributions and sequence word statistics in computational biology.










