Unit 1 - Biological Database Management (804) (IMSC BIOINFORMATICS) Full Notes

Unit 4 – Biological Database Management (804) (IMSC BIOINFORMATICS) Full Notes

33 min read 7,055 words

This module introduces the core concepts of sequence alignment and its central role in bioinformatics. You will learn how DNA, RNA, and protein sequences are compared to identify similarities, conserved regions, and evolutionary relationships. The content explores different types of alignments, scoring matrices such as PAM and BLOSUM, and widely used tools like BLAST and FASTA. It also covers advanced algorithms including Smithโ€“Waterman and PSI-BLAST, along with concepts like PSSM, helping you understand how sequence analysis supports functional prediction, evolutionary studies, and modern biological research.

SEQUENCE ALIGNMENT

Sequence Alignment refers to the process of comparing two or more sequences to find out the regions of higher similarity or evolutionary conserved regions between the query sequences.

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

Sequence alignment helps in understanding evolutionary relationships, identifying conserved regions, and predicting functional elements.

Sequence Alignment and Analysis is the core of many Bioinformatics applications and analyses and to extract the substantial amount of information from these sequence alignments special skills are required from the analytics.

Sequence analysis refers to computational analysis of the sequences of amino acids or nucleotide residues to extract the knowledge about the properties, structure, biological functions and evolutionary diversity of proteins or DNA/RNA molecules, respectively.

In bioinformatics Sequence alignment is done for:

  1. To check the relatedness of two or more sequence.
  2. To check the function of the sequence.
  3. To check the mutation of the sequence.
  4. To find the conserved regions of a given sequence.
  5. To determine the evolutionary relationship.
  6. To Predict new members of gene family.
  7. It can be used to predict the location and function of protein-coding and transcription regulation in genomic DNA.
  8. To find structurally or functionally similar regions within proteins.

Types of Alignments

On the basis of numbers of sequences to be aligned, sequence alignment is of two types:

Pairwise Sequence Alignment (Alignment of two sequences) โ€” The side by side arrangement of a pair of sequence to find out the relatedness among them is known as pairwise sequence alignment. Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or global alignments of two query sequences. The main objective of Pairwise alignment is to obtain the highest possible score which indicates the degree of similarity between the two sequences. The optimality of the alignment is based on the score (scoring scheme).

Multiple Sequence Alignment (Alignment of more than 3 sequences) โ€” It is the process to aligning three or more sequences to find out the relatedness among them. Multiple sequence alignments are used to identify conserved sequence regions and to construct phylogenetic trees, which help us understand the functional and evolutionary relationships between different species or groups of organisms.

On the basis of alignment algorithms, it can be divided into:

Local Alignment โ€” It finds local regions with the highest level of similarity between the two sequences. In local alignment the length of sequences may not equal. It does not count all the matches in sequences. It only searches for the conserved sequence only. Local alignment is suitable for aligning more divergent sequences or distantly related sequences. It aims in searching similarities in large sequences and identifying conserved regions (Domains & Motifs) within two proteins. The Smithโ€“Waterman algorithm is a general local alignment method based on dynamic programming.

Global Alignment โ€” In global alignment, an attempt is made to align the entire sequence (end to end alignment). If two sequences have approximately the same length and are quite similar, they are suitable for global alignment. It counts all the matches of characters. It attempts to align the maximum of the entire sequence. It is suitable for aligning two closely related sequences. Usually done for comparing homologous genes. A general global alignment technique is the Needlemanโ€“Wunsch algorithm, which is based on dynamic programming.

Applications of Sequence Alignment

  1. Sequence alignment can identify unknown sequences by comparing them with already known sequences in databases.
  2. Sequence alignment is also used to identify conserved sequence patterns and motifs, which helps to characterize the functions of the sequences.
  3. Sequence alignment can also produce phylogenetic trees and obtain information about the evolutionary relationship between the sequences aligned.
  4. Sequence alignment can also predict proteins’ secondary and tertiary structures. It can also predict gene locations and new members of gene families.
  5. Sequence alignment can also be used to develop degenerate PCR primers by analyzing multiple related sequences.
  6. Sequence alignment assists in identifying potential drug targets by comparing protein sequences of disease-related genes. It helps in understanding the functional implications of genetic variations and mutations associated with diseases. Sequence alignment is also used in virtual screening, where it aids in identifying drug candidates by aligning them with target protein sequences.
  7. Sequence alignment is utilized in personalized medicine to analyze an individual’s genetic variation and identify disease-causing mutations or genetic predispositions.

SCORING MATRICES

  • The scoring matrix are also called substitution matrix.
  • Scoring matrices are the scores for matches, penalty for mismatches and gaps represented in a tabular form.
  • The main aim of scoring matrix is to obtain the optimal alignment between biological sequences.
  • Optimal alignment helps to detect homology relationship between the biological sequence.
  • The scoring matrix used for finding optimal alignment between protein sequences are complex than that of scoring matrix used for nucleotide sequence alignment.
  • For DNA sequence it is a 4×4 matrix and for protein sequence it is a 20×20 matrix where diagonal of scoring matrix represents the score for matches and other non-diagonal elements represents the score for mismatches.

Example of DNA Scoring Matrix:

  • In the first matrix (Identity matrix), only matches are given a positive score of 3 and all others are -1.
  • In the second matrix (Transition-transversion matrix), it shows different penalties for transition and transversion mutations.
Identity Matrix
ATGC
A3-1-1-1
T-13-1-1
G-1-13-1
C-1-1-13
Transition-Transversion Matrix
ATGC
A3-2-1-2
T-23-1-1
G-1-13-2
C-2-1-23

The two well-known families of scoring matrix are:

  1. PAM (Point Accepted Mutation)
  2. BLOSUM (Block Substitution Matrix)

The choice of scoring matrix can affect the outcome of an alignment.

Scoring matrices are essential for:

  • Identifying the conserved regions of proteins
  • Predicting the function of unknown proteins
  • Understanding the evolutionary history of genes

Applications of Scoring Matrix in Sequence Alignment

Pairwise Sequence Alignment โ€” Scoring matrices are extensively used in pairwise sequence alignment algorithms like Smith-Waterman and Needleman-Wunsch. These algorithms compare two sequences and assign scores based on the similarity of amino acids (in proteins) or nucleotides (in DNA/RNA) using the values defined in the scoring matrix. The scores ultimately determine the best alignment between the sequences.

Database Searches (e.g., BLAST) โ€” Tools like BLAST (Basic Local Alignment Search Tool) use scoring matrices to identify similar sequences in databases. By assigning scores to alignments based on the sequence similarity, BLAST can efficiently identify homologous sequences by comparing query sequences against a database of known sequences.

Detecting Evolutionary Relationships โ€” Scoring matrices reflect evolutionary relationships between sequences. Matrices like BLOSUM (Blocks Substitution Matrix) and PAM (Point Accepted Mutation) are constructed based on observed frequencies of amino acid substitutions in related sequences. They can therefore be used to infer evolutionary distances and relationships among sequences.

Understanding Sequence Conservation โ€” Scoring matrices quantify the degree of conservation at each position in aligned sequences. Highly positive scores in the matrix indicate conserved positions (important for structure or function), whereas lower scores suggest regions that are more tolerant to substitutions. This helps in identifying functionally important regions in proteins.

Optimal Alignment Selection โ€” Scoring matrices aid in selecting the most optimal sequence alignments by assigning scores to possible alignments and selecting the one(s) with the highest score. This is crucial for tasks such as identifying conserved domains or motifs across related sequences.

Quantifying Similarity and Homology โ€” Scoring matrices provide a quantitative measure of similarity and homology between sequences. Higher scores generally indicate stronger similarity and potential homology, whereas lower scores suggest more distant relationships.


BLOSUM MATRICES

  • It stands for Block Substitution Matrix.
  • The BLOSUM matrix was developed by Henikoff and Henikoff in the early 1990s.
  • They created the matrix by examining a large number of related protein sequences and calculating the frequency at which different amino acids were substituted for each other.
  • The Blosum approach was introduced by Henikoff and Henikoff in 1992.
  • It is a substitution matrix used for sequence alignment for proteins.
  • The BLOSUM matrix is a square matrix where each cell represents a substitution score. The score in each cell indicates how often one amino acid is substituted for another in related proteins. The higher the score, the more likely the substitution occurs.
  • Blosum are based on local alignments.
  • It is used to score alignments between evolutionary diverged protein sequences.
  • All matrices are calculated directly; no extrapolations are used.
  • Blosum is derived from sequence blocks โ†’ sequence block represents a portion of ungapped region of MSA โ†’ these regions represent highly conserved segments among a group of proteins.
  • First the MSA is performed using related sequences; from the final MSA, highly conserved regions are collected which are the ungapped regions of MSA (called as blocks). These blocks from the initial dataset are used for construction of BLOSUM Matrix.
  • The aim is to observe alignment pairs of different amino acids in the related protein families and derive amino acid substitutions which occur too frequently in homologous sequences.

In BLOSUM:

  • High number โ†’ closely related sequence
  • Low number โ†’ distant sequence

Different versions of the BLOSUM matrix exist, such as BLOSUM30, BLOSUM45, BLOSUM62, BLOSUM80, etc. The number refers to the percentage identity cutoff used to generate the matrix. For example, BLOSUM62 is generated from sequences sharing at least 62% identity.

The BLOSUM matrix is primarily used in sequence alignment algorithms like BLAST and FASTA which are crucial for comparing protein or DNA sequences.

  • BLOSUM80: 80% similarity (closely related)
  • BLOSUM62: 62% โ€” default matrix of BLAST
  • BLOSUM45: 45% similarity (distantly related)

BLOSUM-62 is the most popular for best general optimal alignment.

Hence BLOSUM acts as a substitution matrix for alignment of distantly related sequences, while BLOSUM80 is suitable for alignment of closely related sequences.

Steps to Construct BLOSUM Matrix

Step 1 โ€” Data Filtering: Sequences beyond the similarity level are filtered out from the block. E.g. BLOSUM80 โ†’ all sequences beyond 80% are filtered out. After filtering, any randomly chosen pair from the block will have at most 80% identity. For this step, a unitary matrix is used where a score of 1 is given to identity/match and a score of 0 is given for mismatch.

Step 2: In the second step, relative frequencies of all the amino acids present in the block are calculated.

Let Fi = relative frequency of amino acid i

Fi = Total number of amino acid (i) present in block / Total number of amino acids present in the block

The total number of amino acids present in the block can be calculated by multiplying the number of rows with the number of columns of the block.

Step 3: Total number of observed alignments between all pairs of amino acids are calculated. Then their relative frequencies are calculated by dividing each possible alignment pair.

Let F(ij) denotes relative frequency of an alignment between amino acid i and j:

F(ij) = total number of times that alignment pair is observed in the data / m ร— nC2

Where m is the total number of columns in the block and n is the number of sequences in the column.

Step 4: As the number of observed alignments of an alignment pair ij directly depends upon their relative frequency, in this step the expected alignment between any amino acid is calculated according to their relative frequency by:

  • Eij = 2Fi ร— Fj, if i โ‰  j
  • Eij = Fi ร— Fj, if i = j

Step 5: In this step we are testing two hypotheses โ€” the probability that 2 amino acids aligned just by chance, versus the probability that they aligned due to a homology relationship between 2 proteins.

Odd ratio, Rij = Observed frequency / Expected alignment frequency = Fij / Eij

Where i and j are any amino acid from the 20 standard amino acids.

  • If the observed frequency of alignment between amino acids is higher than their expected frequency of alignment, this implies that there is strong similarity between these 2 amino acids. Usually this substitution occurs between amino acids belonging to the same functional group.
  • However, if the observed alignment is much less than the expected alignment, it implies that 2 amino acids are not compatible with each other (e.g. substitution between aliphatic and charged amino acids).

The log odd ratio = logโ‚‚(Rij)

BLOSUM score for alignment between amino acid i and j:

Sij = 2 ร— Rij (rounded off to the nearest whole digit)


PAM

  • It stands for Point Accepted Mutation, also known as Percentage of Acceptable Point Mutation per 100.
  • It was introduced by Margaret Dayhoff in 1979.
  • PAM is based on global alignment (evolutionary model).
  • An accepted mutation is a mutation that occurred and was positively selected by the environment.

Evolutionary time โ†’ measured in % mutation

  • E.g. PAM 1 = 1% mutation = One mutation per 100 residues.
  • In PAM, each column and row represents one of the twenty standard amino acids.
  • A PAM is a relative measure of evolutionary distance:
    • PAM1 = 1 accepted mutation per 100 amino acids
    • PAM250 = 250 mutations per 100 amino acids, so 2.5 mutations per amino acid
  • PAM is derived from global alignment of closely related protein sequences.

Relation between PAM and BLOSUM

  • BLOSUM matrix with a higher number and PAM matrix with low numbers are both designed for comparisons of closely related sequences.
  • BLOSUM matrix with a low number and PAM matrix with high numbers are designed for comparisons of distantly related proteins.

Code for the PAM vs BLOSUM divergence diagram:

Bioinformatics Alignment Logic

Evolutionary Divergence Scale

PAM 1
BLOSUM 80
PAM 120
BLOSUM 62
PAM 250
BLOSUM 45
โ—€
โ–ถ
Close Relatives
Increasing Divergence
Distant Relatives

PAM vs BLOSUM Comparison Table

Bioinformatics Utility

Substitution Matrix Comparison

PAMBLOSUM
Based on mutational model of evolution.Based on multiple alignment of blocks.
Designed to track the evolutionary origin.Designed to find proteins conserved domains.
PAM matrices are based on global alignments of closely related sequences.BLOSUM matrices are based on local alignments.
The PAM 1 is calculated from comparison of sequences with no more than 1% divergence.BLOSUM62 is calculated from comparisons of sequences with at least 62% identity in the blocks.
Other PAM matrices are extrapolated from PAM1.All BLOSUM matrices are based on observed alignments. They are not extrapolated.
Higher the number of PAM, more divergent species are present.Lower the number of BLOSUM, more divergent species are present.
It is built from a small amount of data.It is built from a vast amount of data.
It performs better for finding global alignments and remote homologs.It is better for finding local alignments.

BLAST (Basic Local Alignment Search Tool)

  • BLAST is a greedy algorithm that was developed by Altschul et al.
  • The BLAST algorithm is a widely used tool for comparing biological sequences.
  • It is a sequence similarity search program that is used to compare a query sequence with a sequence database and finds the similarity between them.
  • It is similar to FASTA but more efficient.
  • BLAST performs “Local” alignments.
  • BLAST is a heuristic method which means that it is a dynamic programming algorithm that is faster, efficient but relatively less sensitive.
  • For BLAST, any sequence has a query sequence and a target sequence/database. The query sequence is the sequence for which we want to find out the similarity, and the target sequence is a sequence/database against which the query sequence is aligned.
  • BLAST returns the output in the form of hit tables that are arranged in decreasing order of matched accession numbers along with their titles, query coverage, sequence identity, score, and an E-value in separate columns.
  • The reliability of the compared sequences is assessed by E-value.
  • BLAST is an improvisation over FASTA in the sense that it is faster, more sensitive, more statistically significant, and easy to use.
  • BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
  • BLAST is an open-source program and anyone can download and change the program code.

BLAST performs the alignment in 3 basic steps:

  1. First, BLAST applies a word search in which it removes the higher complexity regions and then looks for short stretches of a fixed length of the query sequence.
  2. Secondly, BLAST identifies the exact word matches from the database. Those words which have scored equal to or greater than the threshold (S) are taken for alignment. These obtained alignments are called “Hits”.
  3. Lastly, BLAST extends the alignment in both directions as an ungapped alignment that stops at the maximum score and inserts a gap.

Working of BLAST

BLAST identifies homologous sequences using a heuristic method which initially finds short matches between two sequences; thus, the method does not take the entire sequence space into account. After an initial match, BLAST attempts to start local alignments from these initial matches. This also means that BLAST does not guarantee the optimal alignment, thus some sequence hits may be missed. In order to find optimal alignments, the Smith-Waterman algorithm should be used.

The BLAST algorithm works as follows:

  1. Seeding โ€” Prepare a list of short, fixed-length segments (words) from the query.
  2. Searching โ€” Find highly similar or exact matches for each word.
  3. Extension โ€” Extend each match to a larger match.
  4. Evaluation โ€” Evaluate the results using E-values.

BLAST Algorithm (Detailed Steps)

  1. The first step is to create a lookup table or list of words from the query sequence (also called seeding). BLAST takes the query sequence and breaks it into short segments called words. For protein sequences, each word is usually three amino acids long, and for DNA sequences, each word is usually eleven nucleotides long.
  2. The second step is to search a database of known sequences to find any sequences that contain the same words as the query sequence. This identifies database sequences containing the matching words.
  3. BLAST then scores the similarity of the matching words. The matching of words is scored by a given substitution matrix. If a word is above a certain threshold, it is considered a match.
  4. Two commonly used substitution matrices for protein sequences are PAM (Percent Accepted Mutations) and BLOSUM (Blocks Substitution Matrix). For nucleotide sequences, the scoring matrix is based on match-mismatch scoring.
  5. The fourth step involves pairwise alignment by extending the words in both directions while counting the alignment score using the same substitution matrix. If the score drops below a certain threshold due to differences or mismatches, the alignment stops. The resulting aligned segment pair without gaps is called the High-Scoring Segment Pair (HSP).

BLAST also calculates a statistical significance value for each alignment called the E-value (Expect value). The E-value represents the probability of obtaining a sequence match by random chance. A lower E-value indicates that the sequence match is less likely to be a result of random occurrence. Hence, the lower the E-value, the higher the level of significance.

Types of BLAST

BLASTN โ€” Compares a nucleotide query sequence to a nucleotide sequence database.

BLASTP โ€” Compares a protein query sequence to a protein sequence database.

BLASTX โ€” Compares a nucleotide query sequence to a protein sequence database by translating the query sequence into its six possible reading frames and aligning them with the protein sequences.

TBLASTN โ€” Compares a protein query sequence to a nucleotide sequence database by translating the nucleotide sequences in all six reading frames and aligning them with the protein sequence.

TBLASTX โ€” Compares a nucleotide query sequence to a nucleotide sequence database by translating the query sequence in all six reading frames and aligning them with the nucleotide sequences.

PSI-BLAST (Position-Specific Iterative BLAST) โ€” Very sensitive and usually used for protein similarity search. It is a program based on the BLAST 2.0 algorithm designed to detect relationships between the query and members of the database not necessarily detectable by standard BLAST searches. The query sequence is taken and subjected to BLASTP which results in the formation of a multiple sequence alignment (MSA) of the most similar sequences.

PHI-BLAST (Pattern-Hit Initiated BLAST) โ€” A search program that combines matching of regular expressions with local alignments surrounding the match. The program uses an input sequence and a defined pattern to query a protein database. The pattern is defined in PROSITE format and is used as the seed for the alignment.

Advantages of BLAST

  • BLAST is fast and efficient, making it possible to handle large databases of sequences.
  • It is a flexible and versatile tool as it can be used to search for similarities in both nucleotide and protein sequences.
  • It is highly sensitive, which allows the identification of even small similarities between sequences.
  • It aims to identify regions of local similarity between the query sequence and the database sequence, rather than attempting to align the entire sequences.
  • It has a user-friendly interface that makes it easy to input query sequences and interpret the results.

Limitations of BLAST

  • Sensitivity to Sequence Divergence โ€” BLAST may have difficulty identifying very distantly related sequences due to their low sequence similarity.
  • False Positives โ€” BLAST may generate false-positive matches due to random similarities or sequence motifs that are not functionally significant.
  • Statistical Significance โ€” Determining the statistical significance of BLAST matches can be challenging.
  • Alignment Quality โ€” BLAST primarily focuses on local alignments, which may result in alignments with variable quality. Gaps and mismatches are allowed, potentially leading to alignments that are not biologically meaningful.
  • Database Bias โ€” Biases in the database can impact the accuracy and interpretation of results, particularly when analyzing sequences from non-model organisms or poorly characterized species.
  • Resource Requirements โ€” BLAST searches can be computationally intensive, especially when analyzing large datasets or running complex algorithms.

Applications of BLAST

  1. Sequence Similarity Search โ€” BLAST is primarily used to search for similar sequences in large databases, allowing researchers to identify homologous sequences and infer functional and evolutionary relationships.
  2. Functional Annotation โ€” It helps annotate newly sequenced genes or proteins by comparing them with sequences of known function.
  3. Identification of Conserved Domains โ€” BLAST can identify conserved protein domains or motifs shared among different proteins, aiding in functional and structural predictions.
  4. Genomic and Metagenomic Analysis โ€” BLAST is used to analyze whole genomes or metagenomic datasets to study genetic variations, gene function, and microbial diversity.
  5. Drug Discovery โ€” BLAST can identify potential drug targets by comparing pathogen sequences with host sequences to find differences that could be exploited for drug development.
  6. Evolutionary Studies โ€” BLAST is used in phylogenetic analysis to study evolutionary relationships by comparing sequences across different species.
  7. Primer Design โ€” BLAST can be utilized to design primers for PCR experiments. By aligning primers against a database of sequences, BLAST facilitates the identification of potential binding sites and the evaluation of specificity and cross-reactivity.
  8. Comparative Genomics โ€” BLAST is widely used to compare and analyze the genomes of various organisms. It facilitates the identification of conserved regions, gene families, gene duplications, and genome rearrangements.

FASTA

  • FASTA stands for “fast-all” or “FastA”.
  • It was the first database similarity search tool developed, preceding the development of BLAST.
  • FASTA is another sequence alignment tool which is used to search similarities between sequences of DNA and proteins.
  • FASTA uses a “hashing” strategy to find matches for a short stretch of identical residues with a length of k. The string of residues is known as k-tuples or ktups, which are equivalent to words in BLAST, but are normally shorter than the words.
  • Typically, a ktup is composed of two residues for protein sequences and six residues for DNA sequences.
  • The query sequence is broken down into sequence patterns or words known as k-tuples and the target sequences are searched for these k-tuples in order to find the similarities between the two.
  • FASTA is a fine tool for similarity searches.

FASTA Programs

FASTA was originally developed for comparing protein sequences. The original program was referred to as FASTP. There are now different FASTA programs available:

FASTA โ€” Compares a DNA query sequence against a database of DNA sequences or a protein query sequence against a database of protein sequences using the FASTA algorithm.

SSEARCH โ€” Performs protein-protein or DNA-DNA comparisons using the Smith-Waterman algorithm.

GGSEARCH / GLSEARCH โ€” Works using a global alignment algorithm (GGSEARCH) or a combination of global and local alignment algorithms (GLSEARCH) to compare protein and nucleotide sequences.

FASTX / FASTY โ€” Compares a DNA sequence and a database of protein sequences by translating the DNA sequence into three frames and allowing gaps and frameshifts.

TFASTX / TFASTY โ€” Compares a protein sequence and a database of DNA sequences. The DNA sequence is translated in six frames โ€” three in the forward direction and three in the reverse direction.

FASTF / TFASTF โ€” Compares mixed peptide sequences against protein (FASTF) or translated DNA (TFASTF) databases.

FASTS / TFASTS โ€” Compares a set of short peptide fragments against the protein (FASTS) or translated DNA (TFASTS) databases.

Working of FASTA

Step 1: Identifying Regions (Hashing Step) The first step is identifying regions with high similarity by creating a lookup table for the query sequence. The query sequence is first broken down into smaller words known as k-tuples (ktup). When the ktup value is increased, the number of background word hits is reduced, enhancing the overall search speed. k-tuple is usually 2 for proteins and 6 for nucleotide sequences. Similar regions are represented as diagonals in a two-dimensional matrix. The ten regions with the highest density of word matches are the high-similarity regions, and these best ten diagonals are saved.

Step 2: Re-Scoring The ten best diagonals are rescored using suitable scoring matrices. For protein, BLOSUM50 or PAM matrix is used; for DNA sequences, the identity matrix is used. A subregion with the highest score is identified for each of the rescanned diagonal regions. These high-scoring subregions within the diagonals are called initial regions.

Step 3: Joining Threshold Next, a score cutoff or the joining threshold is applied that excludes segments unlikely to be part of the final alignment. The library sequences are ranked based on their initial scores. The regions with initial scores above the pre-set threshold are selected and checked to see if they can be joined together. This step introduces gaps between the diagonals while applying gap penalties.

Step 4: Final Alignment Finally, the gapped alignment is refined to produce the final alignment using the banded Smith-Waterman algorithm, which is a dynamic programming algorithm that calculates the optimal score (opt) for alignment. This score is used for statistical calculations.

Applications of FASTA

  1. FASTA can be used in sequence alignment to identify regions of similarity. This is useful for identifying conserved regions in DNA or protein sequences, which can help to identify functional domains or motifs and provide insights into the biological function of the sequence.
  2. FASTA can be used to search large databases of sequences to find matches to a given query sequence. This helps to identify homologous sequences, which can help to predict the function of a newly identified sequence.
  3. FASTA can construct phylogenetic trees by aligning sequences from different species and identifying evolutionary relationships between them.

BLAST vs FASTA Comparison Table

Sequence Alignment Algorithms

BLAST vs. FASTA Technical Profile

BLASTFASTA
BLAST stands for Basic Local Alignment Search Tool.FASTA is short for “fast-all” or “FastA”.
BLAST is an algorithm for comparing primary biological sequence information like nucleotide or amino acid sequences.FASTA is a DNA and protein sequence alignment software package.
It is hosted by NCBI.It is hosted by EBI.
BLAST uses local sequence alignment.FASTA uses local sequence alignment first and then extends the similarity search to global alignment.
BLAST searches similarities in local alignment by comparing individual residues in the two sequences.FASTA searches similarities in local alignments by comparing sequence patterns or words.
BLAST uses a substitution matrix to find matching words.FASTA identifies identical matching words using the hashing procedure.
BLAST is better for similarity searching in closely matched or locally optimal sequences.FASTA is better for similarity searching in less similar sequences.
BLAST works best for protein searches.FASTA works best for nucleotide searches.
In BLAST, gaps between query and target sequences are not allowed.In FASTA, gaps are allowed.
BLAST is more speedy than FASTA.FASTA is less speedy when compared to BLAST.
BLAST is a sensitive bioinformatics tool.FASTA is more sensitive than BLAST.

SMITH-WATERMAN ALGORITHM

  • The Smith-Waterman algorithm is a dynamic programming algorithm used for local sequence alignment.
  • Dynamic programming is used for optimal alignment of two sequences. It finds the alignment in a more quantitative way by giving some scores for matches and mismatches (Scoring matrices), rather than only applying dots. By searching the highest scores in the matrix, alignment can be accurately obtained.
  • The Dynamic Programming solves the original problem by dividing the problem into smaller independent sub-problems.
  • Smith-Waterman algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981.
  • Used to find optimal local alignment โ†’ we search for conserved sequences โ†’ do not count all matches โ†’ suitable for aligning more distantly related (divergent) sequences โ†’ used for finding out conserved patterns of DNA.
  • This algorithm is used for determining nucleotide or protein sequences.
  • It is built on the idea of comparing segments of all possible lengths between the sequences to identify the best local alignment.

Basic Steps of Smith-Waterman Algorithm

1) Initialization

  • The first approach is to create a matrix with M+1 columns and N+1 rows, where M and N correspond to the size of the sequences to be aligned.
  • The first row and first column are filled with zero.

2) Matrix Fill-up

The recurrence relation is:

M(i,j) = max of:

  • M(i-1, j-1) + S(i, j)
  • M(i, j-1) + g
  • M(i-1, j) + g
  • 0

Where:

  • M(i,j) is the cell of the i-th row and j-th column
  • S(i,j) is the score of aligning the i-th character of sequence 1 and j-th character of sequence 2
  • Unlike the Needleman-Wunsch algorithm, negative scores are not allowed. If the value comes out negative, it is replaced by 0.

3) Traceback and Sequence Alignment

  • This is the final step for the appropriate alignment. First, one needs to find out the maximum score obtained in the entire matrix for the local alignment of the sequences. It is possible that the maximum score can be present in more than one cell; in that case there may be two or more alignments.
  • The traceback follows the path of the highest scores, and when a zero score is encountered, the alignment stops.
  • Unlike the Needleman-Wunsch algorithm, the traceback step may not always start from the rightmost lower cell.
  • The traceback path provides the optimal local alignment of the two sequences, including matches, mismatches, and gaps.

Tools and Applications of Smith-Waterman Algorithm

  1. EMBOSS โ€” A C++ programmed tool, available as both an online and offline tool, which facilitates pairwise local alignment using the Smith-Waterman algorithm. The key term of the algorithm/program is “water”.
  2. BLAST โ€” An online as well as offline tool, which uses the Smith-Waterman algorithm for the alignment process.
  3. BioEdit โ€” A GUI offline tool, which has the facility to align a pair of sequences using the Smith-Waterman algorithm.
  4. EBI-EMBL โ€” Provides an online pairwise alignment tool to align a pair of sequences using the Smith-Waterman algorithm.

PSI-BLAST

  • Position-Specific Iterated (PSI) BLAST is very sensitive and usually used for protein similarity search.
  • PSI-BLAST (Position-Specific Iterated BLAST) is an iterative search algorithm used to find distant homologs of a protein sequence.
  • PSI-BLAST is a program based on the BLAST 2.0 algorithm that is designed to detect relationships between the query and members of the database not necessarily detectable by standard BLAST searches.
  • It works by iteratively building a Position-Specific Scoring Matrix (PSSM) based on the alignments found in previous iterations.

Working of PSI-BLAST

1. Initialization โ€” PSI-BLAST starts with a regular BLAST search using the query sequence against a protein sequence database (e.g., NCBI’s non-redundant protein database). It generates an initial list of similar sequences (hits) along with their alignment scores.

2. Building the PSSM โ€” The initial hits are used to build a PSSM, which is a matrix where each column represents a position in the aligned sequences, and each row represents an amino acid residue. The PSSM contains information about the conservation of amino acids at each position based on the alignments found in the initial search.

3. Iterative Searching โ€” The PSSM is used to perform another round of database searching. This time, the search is more sensitive because the PSSM considers the conservation of amino acids at each position. New hits are identified based on the PSSM, and their alignments are used to update the PSSM.

4. Convergence โ€” The iterative process continues for a predefined number of iterations or until convergence criteria are met (e.g., no new hits found). The final PSSM represents a refined model of the query sequence, incorporating information from multiple iterations of database searching.

5. Scoring and Reporting โ€” The final hits are scored based on the alignment with the PSSM, and statistical measures (e.g., E-value, bit score) are calculated to assess the significance of the matches. PSI-BLAST reports the hits that pass a predefined significance threshold, along with their alignments and scores.

Advantages of PSI-BLAST over Regular BLAST

  1. It can detect remote homologs.
  2. Identification of weakly conserved domains.
  3. Reducing false positives.
  4. Profile-based search.
  5. Database expansion.
  6. Enhanced sensitivity โ€” PSI-BLAST can often find more significant matches than regular BLAST, especially for sequences with low similarity, by using the PSSM to refine the search for similar sequences in subsequent iterations.

POSITION SPECIFIC SCORING MATRIX (PSSM)

  • A Position-Specific Scoring Matrix (PSSM), also known as a position weight matrix (PWM) or position-specific weight matrix (PSWM), is a commonly used representation of motifs (patterns) in biological sequences.
  • PSSM represents the likelihood of each nucleotide or amino acid occurring at each position in a sequence motif.
  • It is constructed from a set of aligned sequences that are thought to be functionally related.
  • They are widely used in tasks such as motif discovery, sequence alignment, and the prediction of binding sites.

Steps to Construct PSSM

  1. First, collect all the homologous sequences.
  2. Perform multiple sequence alignment of all the homologous sequences.
  3. A basic Position Frequency Matrix is created by counting the occurrences of each nucleotide or amino acid at each position in the aligned sequences.
  4. The Position Frequency Matrix is then converted into a Position Probability Matrix by normalizing the counts, dividing each count by the total number of sequences.
  5. Finally, construct the PSSM. The Position Probability Matrix (PPM) can be transformed into a position weight matrix by taking the logarithm of the probabilities, often incorporating pseudo-counts to avoid zero probabilities.

Example of PSSM Construction

Given sequences:

ATGGCATA
CTGGGTAT
ACGGCTTA
GTGGCAAT

Code for the Relative Frequency Table:

Profile Scoring Matrix

Motif Probability Analysis

PositionATGC
10.500.250.25
200.50.750.25
30010
40011
5000.250.75
60.50.500
70.50.500
80.50.500

Uses of PSSM

  1. Transcription Factor Binding Site Prediction โ€” PSSMs are extensively used to predict transcription factor binding sites in genomic sequences. By scoring different regions of DNA, PSSMs can identify potential binding sites where transcription factors are likely to bind, thereby regulating gene expression.
  2. Protein Function Prediction โ€” In proteins, PSSMs can be used to predict functional sites such as:
    • Active Sites: Identifying regions where enzymatic reactions occur.
    • Protein-Protein Interaction Sites: Locating regions involved in binding other proteins.
    • Post-Translational Modification Sites: Predicting sites for phosphorylation, glycosylation, etc.
  3. Motif Discovery โ€” PSSMs can be derived from known sequences to discover new motifs within uncharacterized sequences. Tools like MEME use PSSMs to identify conserved patterns across multiple sequences, suggesting functional or regulatory elements.
  4. Comparative Genomics โ€” By comparing PSSMs across different species, researchers can identify conserved motifs that might play crucial roles in regulatory processes. This helps in understanding evolutionary conservation and functional importance of these motifs.
  5. Sequence Alignment โ€” PSSMs are used in multiple sequence alignment algorithms to score alignments based on the presence of motifs. This aids in aligning sequences more accurately by considering the biological significance of specific regions.
  6. Functional Annotation of Genomes โ€” PSSMs contribute to the annotation of genomes by identifying regions that are likely to have functional significance. This includes identifying coding regions, regulatory elements, and other genomic features.
  7. Enhancer and Promoter Analysis โ€” PSSMs help in identifying enhancer and promoter regions in DNA, which are crucial for the regulation of gene transcription. This is particularly important in understanding the regulatory architecture of genes.
  8. Mutation Analysis โ€” In disease research, PSSMs can be used to analyze the impact of mutations. By comparing the scores of wild-type and mutant sequences, researchers can predict whether a mutation is likely to disrupt a functional or binding site.
  9. RNA Structure and Function Prediction โ€” Similar to DNA and proteins, PSSMs can be applied to RNA sequences to predict functional sites, such as ribosome binding sites or regions involved in splicing.
  10. Regulatory Element Identification โ€” PSSMs are used to identify regulatory elements in DNA sequences by matching known motifs.