This module introduces key concepts in bioinformatics, focusing on Multiple Sequence Alignment (MSA) and Expressed Sequence Tags (ESTs). You will learn how biological sequences are aligned to identify conserved regions, infer evolutionary relationships, and predict structure and function. The content also covers major MSA methods, tools like ClustalW and T-Coffee, and explains how ESTs are generated and used for gene discovery, genome annotation, and functional genomics.
MULTIPLE SEQUENCE ALIGNMENT (MSA)
Multiple Sequence Alignment (MSA) is a technique used to align three or more biological sequences (such as DNA, RNA, or protein sequences) in order to identify regions of similarity that may indicate functional, structural, or evolutionary relationships between the sequences.
The basic concept of MSA involves arranging the sequences in a way that maximizes the number of matched characters (nucleotides or amino acids) while minimizing the number of gaps (insertions or deletions) introduced to achieve the alignment.
The resulting alignment of MSA can be used for two purposes:
- To find the regions of similar sequences in all of the sequences that define a conserved conscience pattern or domain.
- If the alignment is particularly strong, to use the aligned position to try and drive the possible evolutionary relationship among the sequences.
Optimal MSA: It is one alignment out of all possible alignments of a group of sequences in which we get the maximum number of conserved columns and minimum number of variable and gap containing columns.
There are several factors which affect the analysis of conserved regions in a set of sequences such as:
- Number of sequences included in the analysis
- The ratio of the number of almost identical sequences to the number of more distantly related sequences.
Multiple alignments are computationally much more difficult than pairwise alignments.
The primary goals of MSA include:
- Identification of homologous sequences
- Detection of conserved regions
- Structural prediction
- Functional inference
- Phylogenetic analysis
- Conserved domains
- Conserved regions in promoters
Methods of MSA
1. Progressive Alignment Method
The progressive method, also known as the tree-based algorithm, is a step-wise assembly of multiple alignments based on pairwise similarity. This method is called progressive because it aligns sequences in a step-wise manner. It is the most commonly used approach to MSA.
Progressive alignment method is one of the first and most used algorithms to construct MSA. Progressive alignment is a heuristic method which is fast and allows aligning thousands of sequences. The aim of the alignment would be to get the MSA with the highest score possible.
The basic idea behind progressive alignment is to align the closely related sequences first and then use those alignments as a reference to align more distantly related sequences. This process is repeated until all the sequences in the MSA are aligned. Clustal and T-Coffee are two well-known progressive alignment programs.
Steps:
- Pairwise Alignment โ The process begins by aligning pairs of sequences to create a similarity score matrix, which indicates the similarity between each pair of sequences. This is typically done using algorithms such as Needleman-Wunsch or Smith-Waterman for global or local alignment, respectively.
- Guide Tree Construction โ Based on the similarity score matrix, a guide tree (also known as a guide phylogenetic tree or guide alignment tree) is constructed. The guide tree represents the evolutionary relationships between sequences and serves as a guide for the progressive alignment process. Common methods for guide tree construction include neighbor-joining and UPGMA (Unweighted Pair Group Method with Arithmetic Mean).
- Progressive Alignment โ The progressive alignment algorithm uses the guide tree to align sequences in a step-by-step manner, starting from the most closely related sequences and gradually adding more distant sequences to the alignment. The process involves the following steps:
- Initialization: The two most closely related sequences are aligned based on the guide tree.
- Extension: The alignment is progressively extended to include additional sequences, with gaps introduced to maximize alignment scores based on the guide tree.
- Profile Construction: As sequences are added to the alignment, a profile is constructed for each column in the alignment, representing the frequencies of each residue at that position.
- Profile-Profile Alignment: When aligning a new sequence to the existing alignment, the profile of the new sequence is compared to the profiles of the aligned sequences to find the best alignment.
- Refinement โ After the progressive alignment is completed, the alignment can be refined using iterative methods to improve the accuracy of the alignment, such as the iterative refinement method used in the T-Coffee software.
Progressive Alignment Diagram:
2. Iterative Refinement Method
Iterative methods are a class of algorithms used in multiple sequence alignment (MSA) to improve the accuracy of alignments by iteratively refining the initial alignment. This method is particularly useful for aligning distantly related sequences where traditional progressive alignment methods may not be accurate.
In MSA, iterative refinement is a common method used to improve the quality of the alignment. The iterative method was first developed by Barton and Sternberg in 1987. Iterative refinement method developed after Progressive Alignment method and it tries to rectify the errors produced by the Progressive Alignment method.
In this method an initial alignment is generated (by progressive method), then one sequence or a set of sequences is taken out and realigned to a profile of the remaining sequences. If this step improves the score of alignment then choose another sequence for removal and realignment. This task is repeated until there is no further improvement in score.
Iterative refinement is a key step in the process of MSA, as it allows for the improvement of the alignment and identification of the most accurate and biologically relevant alignment of the sequences.
Steps:
- Initialization โ The iterative process begins with an initial alignment, which can be generated using a progressive alignment algorithm or any other alignment method.
- Profile Construction โ A profile is constructed from the initial alignment, representing the frequencies of each residue at each position in the alignment. Profiles capture the conservation patterns in the alignment and are used to guide the alignment process.
- Sequence Realignment โ Sequences are aligned to the current alignment using the profile constructed in the previous step. This step may involve aligning sequences that were not included in the initial alignment or realigning sequences that were poorly aligned in the initial alignment.
- Scoring and Evaluation โ After realignment, the quality of the alignment is evaluated using a scoring function. Common scoring functions include sum-of-pairs scores, which measure the number of correctly aligned residue pairs, and column scores, which measure the conservation of columns in the alignment.
- Iterative Refinement โ The realignment and evaluation steps are repeated iteratively until a stopping criterion is met. This criterion may be a maximum number of iterations, a convergence threshold for alignment scores, or a maximum change in the alignment from one iteration to the next.
- Consensus Alignment โ Finally, a consensus alignment is generated from the iterations, typically by taking the most common residue at each position in the alignment or by using a probabilistic model to combine the alignments.
Iterative Refinement Diagram:
3. Sum of Pair Method
The score of the given MSA is calculated as the sum of scores of all possible pairwise sequence alignments from the given MSA. It is a dynamic programming method. In this method, instead of aligning two sequences at a time with dynamic programming, we need to align 3 or more simultaneously.
The sum of pair method is a scoring function used in multiple alignment to evaluate the quality of the alignment. It is based on the idea that the overall quality of an alignment is the sum of the individual alignments of each pair of sequences in the alignment. A high sum of pair score indicates a better alignment.
Steps of Sum of Pair Method:
- Select scoring matrix
- Calculate the score for each pair of aligned residues
- Sum the scores of all MSA
Tools in MSA
ClustalW
ClustalW is one of the earliest and most widely used progressive alignment algorithms. It is an improvement on the original CLUSTAL program which was introduced in 1994. ClustalW is a popular method for multiple sequence alignment. It works by progressive alignment. It employs a progressive approach along with a series of heuristics to improve alignment quality and efficiency, where sequences are initially aligned in pairs, and then the resulting alignments are progressively combined into a final multiple alignment.
ClustalW has been widely used for aligning (3 or more) nucleotide and protein sequences, essential for studying evolutionary relationships and identifying conserved regions across a set of related sequences. It is fast and efficient, but may not always produce the most accurate alignments.
Working of ClustalW:
- First it performs all possible pairwise alignments between each pair of sequences.
- Calculates the distance between each pair of sequences based on these isolated pairwise alignments.
- Generates a distance matrix.
- Generates a neighbour-joining guide tree from these pairwise distances.
- This guide tree gives the order in which the progressive alignment will be carried out.
Limitations:
- ClustalW may not always produce the most accurate alignments, especially for highly divergent sequences.
- ClustalW uses a progressive alignment approach, which can propagate errors made in the early stages to the final alignment.
- ClustalW can be slow for large datasets. It doesn’t scale well with the increasing number of sequences or sequence length, making it less suitable for very large datasets.
- Compared to newer alignment tools, ClustalW lacks advanced features like iterative refinement, which can improve alignment quality by re-aligning sequences multiple times.
T-Coffee
T-Coffee stands for Tree-based Consistency Objective Function for Alignment Evaluation. T-Coffee is a progressive alignment algorithm that uses a consistency-based approach to align sequences. It was designed to handle large numbers of sequences and can produce highly accurate alignments. It aligns sequences in a pairwise manner and then uses a guide tree to construct the final multiple sequence alignment. It is a versatile multiple sequence alignment method that combines information from pairwise and multiple alignments. It considers the consistency of the alignment with a guide tree, enhancing the accuracy of the final alignment. It is suitable for aligning diverse sequences, including those with complex evolutionary relationships.
T-Coffee has two main features:
- First, it uses heterogeneous data sources to generate multiple alignments. The data from these sources are provided to T-Coffee via a library of pairwise alignments. Thus, T-Coffee can compute MSAs using a library that was generated from a mixture of local and global pairwise alignments.
- Second, it carries out progressive alignment in a way that allows it to consider the alignment between all of the pairs during the generation of the MSA. This gives it the speed of a traditional progressive alignment but with far less tendency to misalign.
T-Coffee Algorithm Steps:
- Create the pairwise alignments
- Calculate the similarity matrix
- Build the multiple progressive alignment following the tree, but taking into account the information from the pairwise alignments
Limitations:
- Computational Intensity โ T-Coffee can be computationally intensive, especially when aligning a large number of sequences. This can lead to longer processing times.
- Memory Usage โ The tool requires significant memory resources, which can be a limitation for users with less powerful hardware.
- Sequence Limit โ T-Coffee can align up to 500 sequences or a maximum file size of 1 MB. This might be restrictive for very large datasets.
- Complexity โ The tool has a steep learning curve due to its numerous options and parameters, which can be overwhelming for beginners.
- Platform Dependency โ While T-Coffee runs on Unix-like platforms, it does not natively support Windows without additional software like Cygwin.
Clustal Omega
Clustal Omega is an advanced and highly efficient tool for multiple sequence alignment, designed to overcome many of the limitations found in its predecessors like ClustalW and Clustal X. Clustal Omega is an updated version of ClustalW that offers improved speed and scalability. It uses a series of heuristics and techniques such as k-means clustering to improve alignment quality and handle large datasets efficiently.
Algorithm:
- Initial Pairwise Alignments โ Clustal Omega starts by performing fast pairwise alignments using a k-tuple heuristic, which is similar to the approach used by other sequence alignment tools. This step involves finding matching k-tuples (subsequences of length k) between pairs of sequences to quickly estimate their similarity.
- Distance Matrix Calculation โ Based on the initial pairwise alignments, Clustal Omega calculates a distance matrix that represents the evolutionary distances between all pairs of sequences. This matrix is essential for the subsequent steps of the algorithm.
- Guide Tree Construction โ Clustal Omega constructs a guide tree using methods like UPGMA (Unweighted Pair Group Method with Arithmetic Mean) or Neighbor-Joining. This tree guides the progressive alignment process, determining the order in which sequences are aligned.
- Sequence Alignment Using HHalign โ One of the distinguishing features of Clustal Omega is its use of profile hidden Markov models (HMMs). Sequences are clustered and each cluster is represented by a profile HMM. Clustal Omega employs HHalign, a part of the HH-suite, to align these profile HMMs. This method enhances the alignment quality, particularly for large and diverse datasets, by effectively capturing the statistical properties of sequence profiles.
- Progressive Alignment โ The guide tree is used to progressively align sequences and profiles. Clustal Omega aligns profiles iteratively, starting with the most similar sequences and gradually incorporating more divergent ones.
MAFFT
MAFFT stands for Multiple Alignment using Fast Fourier Transform. It is an iterative method for multiple sequence alignment. It is known and popular for its speed and accuracy. It uses an iterative refinement approach, starting with an initial progressive alignment and then refining the alignment using iterative methods. Particularly effective for aligning large sets of sequences and handling divergent sequences.
MUSCLE
MUSCLE stands for Multiple Sequence Comparison by Log-Expectation. It is a progressive alignment algorithm known for its speed and accuracy. It uses a progressive alignment approach along with a refinement stage to improve alignment quality.
BLOCKS
BLOCKS is used to identify and align blocks of conserved sequences within a set of sequences being aligned. Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of protein. These blocks of conserved sequences are often referred to as “sequence blocks” or homology blocks. They compare a protein sequence of a database of protein blocks database, retrieve blocks and create a new block respectively.
One common block tool used in MSA is the BLAST algorithm, which is fast and efficient for identifying sequence similarity between two or more biological sequences. BLAST works by searching a database of sequences for matches to a given query sequence and then aligning the matching sequences to identify conserved blocks of sequence.
Applications of MSA
- Phylogenetic Analysis โ MSA is essential for reconstructing evolutionary relationships between species or genes. By aligning homologous sequences from different organisms, researchers can infer the evolutionary history and relatedness of species.
- Functional Annotation โ MSA helps in annotating the function of genes and proteins by identifying conserved regions. Conserved amino acid residues often indicate functional importance, such as active sites in enzymes or binding sites in proteins.
- Structural Biology โ MSA is used in predicting protein structures by identifying conserved regions that may correspond to structural motifs or domains. Aligning protein sequences can reveal evolutionary constraints that influence protein folding and structure.
- Drug Discovery โ MSA is used in comparative genomics to identify potential drug targets. Conserved regions in proteins that are specific to pathogens can be targeted for drug development.
- Sequence Motif Discovery โ MSA is used to identify sequence motifsโshort, conserved sequences that are important for protein function or regulation. Motifs can be regulatory elements, protein binding sites, or structural features.
- Evolutionary Studies โ MSA is used to study the evolution of genes and proteins by comparing sequences from different species. It can reveal patterns of conservation and divergence that are important for understanding evolutionary processes.
- Functional Genomics โ MSA is used in functional genomics to compare gene expression patterns across species. Aligning sequences can help identify conserved regulatory elements that control gene expression.
- Extrapolation โ It is used to check whether an uncharacterised sequence is really a member of a protein family or not.
- Protein Family โ A multiple sequence alignment can help us to decide whether our protein is a member of a known protein family or not.
- DNA Regulatory Elements โ We can use MSA to locate DNA regulatory elements such as binding sites etc.
- PCR Analysis โ A good MSA can help us to identify the less degenerated portions of a protein family in order to find out new members by PCR.
EXPRESSED SEQUENCE TAGS (ESTs)
Expressed sequence tags (ESTs) are short DNA sequences (200โ500 nucleotides). They are derived from complementary DNA (cDNA) that is made from messenger RNA (mRNA) molecules. They can be used to identify genes that are being expressed in a cell at a particular time. ESTs represent the coding regions of genes, without the introns. They can be used for gene discovery, gene mapping, transcriptome analysis and microarray design. They are also used as markers for genome mapping. It is a quick method that is able to give information about the diversity of genes expressed.
The steps involved in generating and utilizing ESTs include:
- mRNA Isolation โ Extracting mRNA from the target tissue or cells. mRNA represents the transcribed and processed form of genes and is used as a template for cDNA synthesis.
- cDNA Synthesis โ Reverse transcription of mRNA into complementary DNA (cDNA) using reverse transcriptase. This process converts the RNA into a stable DNA form, preserving the gene information.
- cDNA Cloning โ The cDNA is then cloned into vectors, typically plasmids, creating a library of cDNA clones. Each clone represents a segment of the expressed genes in the original tissue.
- Single-Pass Sequencing โ DNA sequencing is performed on each cDNA clone, generating short sequences known as Expressed Sequence Tags (ESTs). These tags are typically a few hundred base pairs long and represent a snapshot of the expressed genes.
- EST Analysis โ ESTs are then analyzed to identify potential genes and their functions. EST data can be compared to known sequences in databases to determine the likely identity and function of the expressed genes.
EST Generation Diagram:
Importance of ESTs
Expressed Sequence Tags (ESTs) hold several important roles in genomics and molecular biology research:
- Gene Discovery โ ESTs are a powerful tool for discovering new genes. By sequencing and analyzing ESTs, researchers can identify novel genes and gain insights into the genetic makeup of an organism.
- Functional Annotation โ ESTs contribute to the functional annotation of genomes. They provide valuable information about the expressed genes, their structures, and potential functions. This aids in understanding the biological processes and pathways within an organism.
- Tissue-Specific Expression โ ESTs help identify genes that are specifically expressed in certain tissues or under particular conditions. This information is crucial for understanding the regulation of gene expression and the specialization of tissues.
- Marker Development โ ESTs can be used to develop molecular markers, such as microsatellites or single nucleotide polymorphisms (SNPs). These markers are essential for genetic mapping, population genetics, and marker-assisted breeding in agriculture.
- Comparative Genomics โ By comparing ESTs across different species, researchers can identify conserved genes and infer evolutionary relationships. This comparative genomics approach provides insights into the shared functions and evolutionary history of genes.
- Drug Discovery โ ESTs contribute to the identification of potential drug targets. Understanding the genes expressed in specific disease-related tissues can guide the development of therapeutic interventions.
- Diagnostic and Prognostic Markers โ ESTs associated with diseases or specific conditions can serve as diagnostic or prognostic markers. They can help identify patterns of gene expression associated with particular diseases, aiding in early detection or predicting disease outcomes.
- Genome Annotation โ ESTs play a crucial role in annotating genomic sequences. They provide experimental evidence for the existence and structure of genes, improving the accuracy of genome annotations.
- Evolutionary Studies โ Analysis of ESTs from different species helps in studying the evolution of genes and gene families. Understanding how gene expression has evolved contributes to our knowledge of the diversity of life.



