Restriction enzymes cleave DNA at specific recognition sequences known as restriction sites. When a DNA molecule is subjected to restriction digestion, it is fragmented into pieces of varying lengths called restriction fragments. Understanding the statistical distribution of these fragment lengths is essential in molecular biology, genomics, and sequence analysis. A probabilistic model based on the Poisson process allows prediction of fragment length distributions, which can then be verified against experimental data and validated through simulation.
The Poisson Process Model for Restriction Sites
The occurrence of restriction sites along a DNA sequence is modeled as a Poisson process. Under this model, restriction sites are assumed to arise independently and at random positions along the sequence, with a constant rate of λ sites per base pair (bp).
The probability of observing exactly k restriction sites within a sequence interval of length l bp is given by:
This is the classical Poisson probability mass function, where the expected number of sites in the interval is λl. This model is mathematically tractable and serves as the basis for deriving the fragment length distribution.
Derivation of the Fragment Length Distribution
To determine the distribution of restriction fragment lengths, consider a restriction site located at position y along the DNA molecule. The fragment beginning at y has a length greater than x if and only if there are no restriction sites within the interval (y, y + x). From the Poisson model, this probability is:
The cumulative distribution function (CDF) of fragment length X is therefore:
Differentiating the CDF with respect to x yields the probability density function (PDF):
This is the exponential distribution with parameter λ. The mean fragment length is 1/λ. This result is logically consistent: the higher the restriction site density (large λ), the shorter the expected fragment lengths, and vice versa.
Key Properties of the Exponential Distribution for Fragment Lengths
The following properties are derived directly from the exponential model and are important for exam answers:
- The distribution has a monotone decreasing density, meaning short fragments are always more probable than long fragments.
- The mean fragment length is 1/λ.
- The model predicts that the probability of a fragment exceeding length d decreases exponentially: P(X > d) = e^{−λd}.
- The exponential distribution is memoryless, implying that the length of the remaining fragment beyond any point does not depend on how far we have already traveled along the DNA.
Application to Real Data: AluI Digestion of Bacteriophage Lambda DNA
The model is applied to a specific biological example: AluI restriction enzyme digestion of bacteriophage lambda DNA (recognition sequence: AGCT). The relevant parameters and predictions are summarized below.
Table: Model Parameters for AluI Digestion of Bacteriophage Lambda DNA
| Parameter | Value |
|---|---|
| Genome length (n) | 48,502 bp |
| Probability of recognition site per position (p) | 0.003906 |
| Expected number of sites (λ = np) | 189.45 |
| Normalized fragment length variable (x = d/n, d = 1000 bp) | 0.0206 |
| P(fragment > 1000 bp) = e^{−λx} | e^{−3.903} ≈ 0.020 |
| Total fragments observed | 143 |
| Expected fragments > 1000 bp (143 × 0.020) | ≈ 2.9 |
| Observed fragments > 1000 bp | 10 |
The histogram of actual fragment lengths from the AluI digestion shows the characteristic pattern predicted by the exponential model: a high frequency of short fragments and a declining frequency for longer fragments. However, the observed number of fragments exceeding 1000 bp (10 fragments) is substantially higher than the model prediction of approximately 2.9. This discrepancy suggests that the simple Poisson model does not perfectly describe the distribution of longer restriction fragments in bacteriophage lambda DNA, possibly due to non-random base composition along the genome.
Simulation of Restriction Fragment Lengths
Since real sequences may deviate from idealized probabilistic models, simulation provides a powerful complementary approach to study fragment length distributions.
The iid Sequence Model
In the independent and identically distributed (iid) sequence model, each nucleotide position is assumed to be occupied by one of the four bases (A, C, G, T) independently and with equal probability:
Under this model, a sequence of 48,500 positions (close to the length of lambda DNA) is generated computationally and then scanned for restriction sites.
Steps in the Simulation (R-Based Approach)
The simulation proceeds through the following stages:
- Sequence generation: A random DNA sequence of 48,500 nucleotides is sampled from the set {A=1, C=2, G=3, T=4} with uniform probabilities using the
sample()function in R. - Restriction site identification: A custom function (
rsite) scans the sequence for all positions where the AluI recognition sequence (AGCT = [1, 3, 2, 4]) begins. Each complete match is recorded by its starting position. - Fragment length calculation: Once restriction site positions are identified, fragment lengths are computed by subtracting consecutive site positions. For a linear molecule of length N with sites at positions recorded in a map vector, a function (
flengthr) calculates:- The left-end piece (from the molecule start to the first restriction site)
- Intervening fragments (differences between consecutive sites)
- The right-end piece (from the last site to the end of the molecule)
- Verification: The results are verified by confirming that: (a) the number of fragments equals the number of sites plus one (n sites in a linear molecule generate n+1 fragments), and (b) the sum of all fragment lengths equals the total molecule length.
Table: Comparison of Theoretical Prediction, Simulation Results, and Observed Data
| Feature | Theoretical Model | Simulated Sequence | Observed (Lambda DNA) |
|---|---|---|---|
| Sequence length (bp) | 48,502 | 48,500 | 48,502 |
| Expected number of AluI sites | 190 | 184 | 143 |
| Number of fragments | 191 | 185 | 144 |
| Fragment distribution shape | Exponential | Approximately exponential | Approximately exponential |
| Fragments > 1000 bp (expected) | ~3.8 | ~3–5 (varies per run) | 10 |
| Maximum fragment length | Variable | 1,475 bp | >1000 bp (several) |
| Minimum fragment length | Variable | 5 bp | — |
The simulation result of 184 sites (versus the theoretical expectation of 190) demonstrates natural random variability. A single simulation run represents one possible outcome; to accurately estimate the expected number of sites, multiple simulation runs should be performed and the mean computed.
Comparison of Real and Simulated Fragment Length Distributions
A comparison between the simulated and real lambda DNA fragment distributions reveals the following:
- Both distributions show the highest frequency of short fragments, consistent with the exponential model.
- The simulated distribution shows a more regular exponential decline compared to the real distribution.
- Lambda DNA contains significantly more large fragments than predicted by either the exponential model or the simulation. This is reflected in the different y-axis scales between the two histograms, where the real data histogram has a lower count at short lengths but a heavier tail at long lengths.
- This observation suggests that restriction sites in actual genomes are not distributed as randomly as the iid model assumes. Base compositional biases, evolutionary conservation of sequences, and genomic architecture can all cause deviations from the random Poisson model.
Statistical Testing for Goodness-of-Fit
To formally test whether the observed distribution of restriction fragment lengths for lambda DNA differs significantly from the exponential model, a chi-squared (χ²) goodness-of-fit test can be applied. The procedure is as follows:
- Divide the range of fragment lengths into a series of defined bins (e.g., 0–500 bp, 500–1000 bp, 1000–1500 bp, etc.).
- Use the exponential density function to calculate the expected number of fragments in each bin under the model.
- Compare the expected frequencies with the observed frequencies from the lambda DNA data using the chi-squared statistic.
- A statistically significant χ² value would indicate that the real fragment distribution deviates significantly from the exponential model, confirming the visual impression that the longer fragments are more frequent than predicted.
Significance and Limitations of the Model
The Poisson/exponential model for restriction fragment lengths is valuable because it provides a mathematically precise, closed-form prediction that can be directly compared to experimental data. Its key assumption is that restriction sites occur independently and at random with a uniform rate across the genome. While this approximation works reasonably well for many practical purposes, the model has notable limitations:
- Real DNA sequences are not random — they have specific compositional biases (GC content, codon usage, regulatory motifs) that create non-uniform distributions of restriction sites.
- Longer fragments are systematically underestimated, as demonstrated in the lambda DNA example.
- The iid model assumes equal base frequencies (0.25 each), which is rarely the case in real genomes.
Despite these limitations, the exponential distribution model provides an excellent first approximation and a useful baseline against which real data can be evaluated. More complex probabilistic models, including non-homogeneous Poisson processes or Markov chain sequence models, can be employed when greater accuracy is required.
Conclusion
Restriction fragment length distributions arise naturally from the Poisson model of restriction site occurrence and follow an exponential distribution with parameter λ (the site density per bp). The mean fragment length is 1/λ, and the probability of a fragment exceeding length d is e^{−λd}. Application to bacteriophage lambda DNA digested with AluI confirms the overall exponential shape, though real genomes show more large fragments than predicted. Simulation using the iid model and R validates the theoretical predictions for a random sequence and highlights deviations in real biological data. Chi-squared testing provides the formal statistical framework for assessing goodness-of-fit between the theoretical model and observed data.










