Building Contigs from Cloned Genome Fragments: Coverage, Assembly & Statistical Analysis

The analysis of large genomes usually begins by breaking them into smaller pieces because the genomes of free-living (non-viral) organisms are much larger than the insert sizes that most cloning vectors can accept. Microbial genomes are generally larger than 0.5 × 10⁶ bp and mammalian genomes exceed 10⁹ bp, while cloning vectors such as lambda or cosmid vectors carry inserts of the order of 10⁴ bp and BAC or YAC vectors carry 10⁵ to 10⁶ bp. The genome is therefore represented by a genomic library, which is a collection of clones containing the genome in the form of overlapping inserts in a specified vector. The process of assembling these overlapping clones into continuous segments is called building contigs.

Determination of Number of Clones Required and Concept of Coverage

The number of clones (N) required in a genomic library depends on three parameters: genome length G (in bp), insert length L (in bp), and the probability f_c that any chosen base pair is represented in the library.

The probability that a particular base is recovered in one clone is L/G. Therefore, the probability that it is not covered by one clone is 1 – L/G. After N independent clones, the probability that a base is not covered is (1 – L/G)^N. Hence:

1 – f_c = (1 – L/G)^N

Solving for N (after taking logarithm on both sides):

Minimal Tiling Clone Sets and Fingerprinting in Physical Mapping of DNA

Restriction Fragment Length Distribution in Lambda DNA: Poisson Model

**Figure:** Coverage of a genome of length G by a genomic library containing clones with insert length L. Some genomic regions (shaded) are represented multiple times, while others are represented only once or are absent from the library. Genome coverage refers to the average number of times each nucleotide position is represented by cloned inserts. The broken lines mark the boundaries of the three contigs, which are continuous, gap-free assemblies formed from overlapping clones.

N = log(1 – f_c) / log(1 – L/G)

Example: For E. coli genome (G = 4.6 × 10⁶ bp) using cosmid clones (L = 4 × 10⁴ bp) at f_c = 0.95, N = 343 clones. The total DNA represented is 343 × 4 × 10⁴ = 14 × 10⁶ bp, which is about three times the genome size. Some portions are represented multiple times and others not at all, but on average each position is covered approximately three times.

Coverage (c) is the average number of times any base pair is contained in the inserts of the library and is defined as:

c = NL / G

This can be related to the probability by the Poisson approximation:

1 – f_c ≈ e^{-c} or f_c = 1 – e^{-c}

The coverage required for different probabilities is:

Coverage (c)	Probability (f_c)
1	0.632
2	0.865
3	0.950
4	0.982
5	0.993

For 99% probability, five genome equivalents are needed. Mammalian genomes require nearly half a million lambda clones for reasonable completeness and more for 99% coverage. Handling such libraries is labour-intensive and requires robotics and databases. At higher coverage, random selection becomes inefficient for gap closure, so directed experimental strategies are adopted.

Building Restriction Maps and Contigs from Cloned Fragments

Cloning disrupts the natural ordering of DNA segments. Inserts are generated by shearing or restriction digestion and are usually size-selected before ligation (lambda vectors accept 9–20 kb, cosmids ~40 kb). The genomic positions of clones are initially unknown, so the complete restriction map is built piece by piece using the “bottom-up” approach.

Restriction maps of individual clones are determined using incomplete digestion. If two clones share some of the same restriction sites (producing restriction fragments of identical length), they overlap. The maps are joined at the overlap and extended on both sides to form a contig — a genome segment represented by two or more overlapping clones.

**Figure:** **Mapping a large genome by assembling overlapping representative clones.** **(A)** Clones X and Y are identified as overlapping because they share common restriction fragments, allowing their individual restriction maps to be merged into a single contig. **(B)** Cloned inserts may overlap sufficiently for the overlap to be detected (clones 1 and 2), overlap too little for reliable detection (clones 3 and 4), or exist as isolated **singletons** (clone 5) with no overlap information from neighboring clones. Contigs, singletons, and selected clones with undetected overlaps together form genomic **islands**. **(C)**After overlapping clones are assembled across an entire genome or genomic region, multiple clones typically cover each position (average coverage **c ≈ 5**). To eliminate redundancy, a set of minimally overlapping clones (shown by heavy lines) is selected while maintaining complete genome coverage. The ordered sequence of these clones (indicated by arrows) is known as the **minimal tiling path**.

In a large genomic library, nearly all clonable portions (heterochromatin is difficult to clone) are represented, often many times. The goal is to place each clone in its correct genomic location. From clones with multiple overlaps (e.g., at coverage c ≈ 5), a minimal tiling path is selected a subset of minimally overlapping clones (shown by heavy lines) that still spans the whole genome, reducing redundancy. The path through these clones is called the minimal tiling path.

Cloned inserts may overlap sufficiently for detection, overlap insufficiently (undetectable), or exist as singletons with no neighbours. Contigs, singletons, and undetected overlaps together constitute islands.

Progress in Contig Assembly and Statistical Analysis

Contig assembly depends on recognising overlaps. Some overlaps are too short to detect (shorter than average restriction fragment size), may be missed due to experimental error in fragment length measurement, or may produce similar-sized fragments from unrelated regions. Therefore, matches among several fragments are required before declaring overlap.

Parameters used for analysis:

N = number of clones
L = insert length
G = genome length
Ω = minimum overlap required for detection
θ = Ω / L (fractional overlap)
c = NL / G (coverage)

Overlaps are detected only if they exceed Ω = θL. The expected number of islands (Γ) is derived using Poisson distribution by considering clone end positions:

Γ = (G/L) × (c × e^{-(1-θ)c})

The curve of number of islands versus coverage shows that the number first rises (new singletons dominate), then falls as contigs grow and merge by overlapping ends. New clones falling inside existing contigs do not increase island number. Eventually, the number approaches 1 (complete genome in one contig), but the final gap closure is slow because the probability of clones landing in small gaps decreases.

The expected number of singletons is:

Number of singletons = N × e^{-2(1-θ)c}

Reality Check Example (Kohara et al., 1987): A complete restriction map of E. coli was produced using 1025 lambda clones (L = 1.55 × 10⁴ bp, G = 4.7 × 10⁶ bp). Coverage c = 3.38 and θ ≈ 0.19 (based on six consecutive sites from eight enzymes with ~32 sites per insert).

Calculated: ~66.3 islands and ~4.3 singletons. Experimental: 70 islands, of which 7 were singletons.

The excellent agreement between theory and experiment confirms the model. Similar early success was reported by Olson et al. (1986).

**Figure:** Expected number of genomic **islands** as a function of genome coverage for different values of the fractional overlap (θ) required to reliably detect overlaps between cloned DNA fragments. Increasing genome coverage generally reduces the number of islands by improving clone connectivity, whereas higher overlap detection thresholds (θ) increase the number of islands because fewer clone overlaps are recognized.

Conclusion

Building contigs from cloned genome fragments is a methodical process that starts with proper library construction and coverage calculation, proceeds through restriction mapping and overlap detection, and is guided by Poisson statistics to monitor assembly progress. It efficiently converts random clones into ordered contigs and a minimal tiling path, providing a complete physical map of the genome. This strategy is highly effective for large-scale genome analysis.

TRENDING ON HEALTH (TOP)

For Women on Antidepressants, Creatine Showed a Possible Extra Boost

Did the iPhone Quietly Reshape When and Whether Americans Have Children?

For People Antidepressants Never Helped, a 30-Minute Home Session Is Now FDA-Approved

Scientists Say Your Next Tube of Toothpaste Could Be Made From Human Hair

NOW ON AIR (RBC)

Minimal Tiling Clone Sets and Fingerprinting in Physical Mapping of DNA