The analysis of large genomes usually begins by breaking them into smaller pieces because the genomes of free-living (non-viral) organisms are much larger than the insert sizes that most cloning vectors can accept. Microbial genomes are generally larger than 0.5 × 10⁶ bp and mammalian genomes exceed 10⁹ bp, while cloning vectors such as lambda or cosmid vectors carry inserts of the order of 10⁴ bp and BAC or YAC vectors carry 10⁵ to 10⁶ bp. The genome is therefore represented by a genomic library, which is a collection of clones containing the genome in the form of overlapping inserts in a specified vector. The process of assembling these overlapping clones into continuous segments is called building contigs.
Determination of Number of Clones Required and Concept of Coverage
The number of clones (N) required in a genomic library depends on three parameters: genome length G (in bp), insert length L (in bp), and the probability f_c that any chosen base pair is represented in the library.
The probability that a particular base is recovered in one clone is L/G. Therefore, the probability that it is not covered by one clone is 1 – L/G. After N independent clones, the probability that a base is not covered is (1 – L/G)^N. Hence:
1 – f_c = (1 – L/G)^N
Solving for N (after taking logarithm on both sides):

N = log(1 – f_c) / log(1 – L/G)
Example: For E. coli genome (G = 4.6 × 10⁶ bp) using cosmid clones (L = 4 × 10⁴ bp) at f_c = 0.95, N = 343 clones. The total DNA represented is 343 × 4 × 10⁴ = 14 × 10⁶ bp, which is about three times the genome size. Some portions are represented multiple times and others not at all, but on average each position is covered approximately three times.
Coverage (c) is the average number of times any base pair is contained in the inserts of the library and is defined as:
c = NL / G
This can be related to the probability by the Poisson approximation:
1 – f_c ≈ e^{-c} or f_c = 1 – e^{-c}
The coverage required for different probabilities is:
| Coverage (c) | Probability (f_c) |
|---|---|
| 1 | 0.632 |
| 2 | 0.865 |
| 3 | 0.950 |
| 4 | 0.982 |
| 5 | 0.993 |
For 99% probability, five genome equivalents are needed. Mammalian genomes require nearly half a million lambda clones for reasonable completeness and more for 99% coverage. Handling such libraries is labour-intensive and requires robotics and databases. At higher coverage, random selection becomes inefficient for gap closure, so directed experimental strategies are adopted.
Building Restriction Maps and Contigs from Cloned Fragments
Cloning disrupts the natural ordering of DNA segments. Inserts are generated by shearing or restriction digestion and are usually size-selected before ligation (lambda vectors accept 9–20 kb, cosmids ~40 kb). The genomic positions of clones are initially unknown, so the complete restriction map is built piece by piece using the “bottom-up” approach.
Restriction maps of individual clones are determined using incomplete digestion. If two clones share some of the same restriction sites (producing restriction fragments of identical length), they overlap. The maps are joined at the overlap and extended on both sides to form a contig — a genome segment represented by two or more overlapping clones.

In a large genomic library, nearly all clonable portions (heterochromatin is difficult to clone) are represented, often many times. The goal is to place each clone in its correct genomic location. From clones with multiple overlaps (e.g., at coverage c ≈ 5), a minimal tiling path is selected a subset of minimally overlapping clones (shown by heavy lines) that still spans the whole genome, reducing redundancy. The path through these clones is called the minimal tiling path.
Cloned inserts may overlap sufficiently for detection, overlap insufficiently (undetectable), or exist as singletons with no neighbours. Contigs, singletons, and undetected overlaps together constitute islands.
Progress in Contig Assembly and Statistical Analysis
Contig assembly depends on recognising overlaps. Some overlaps are too short to detect (shorter than average restriction fragment size), may be missed due to experimental error in fragment length measurement, or may produce similar-sized fragments from unrelated regions. Therefore, matches among several fragments are required before declaring overlap.
Parameters used for analysis:
- N = number of clones
- L = insert length
- G = genome length
- Ω = minimum overlap required for detection
- θ = Ω / L (fractional overlap)
- c = NL / G (coverage)
Overlaps are detected only if they exceed Ω = θL. The expected number of islands (Γ) is derived using Poisson distribution by considering clone end positions:
Γ = (G/L) × (c × e^{-(1-θ)c})
The curve of number of islands versus coverage shows that the number first rises (new singletons dominate), then falls as contigs grow and merge by overlapping ends. New clones falling inside existing contigs do not increase island number. Eventually, the number approaches 1 (complete genome in one contig), but the final gap closure is slow because the probability of clones landing in small gaps decreases.
The expected number of singletons is:
Number of singletons = N × e^{-2(1-θ)c}
Reality Check Example (Kohara et al., 1987): A complete restriction map of E. coli was produced using 1025 lambda clones (L = 1.55 × 10⁴ bp, G = 4.7 × 10⁶ bp). Coverage c = 3.38 and θ ≈ 0.19 (based on six consecutive sites from eight enzymes with ~32 sites per insert).
Calculated: ~66.3 islands and ~4.3 singletons. Experimental: 70 islands, of which 7 were singletons.
The excellent agreement between theory and experiment confirms the model. Similar early success was reported by Olson et al. (1986).

Conclusion
Building contigs from cloned genome fragments is a methodical process that starts with proper library construction and coverage calculation, proceeds through restriction mapping and overlap detection, and is guided by Poisson statistics to monitor assembly progress. It efficiently converts random clones into ordered contigs and a minimal tiling path, providing a complete physical map of the genome. This strategy is highly effective for large-scale genome analysis.












