This module provides a comprehensive overview of the National Center for Biotechnology Information (NCBI) and its critical role in bioinformatics. You will learn about the NCBI data model, including how biological data such as DNA, RNA, and proteins are structured, stored, and interconnected. The content covers major databases like GenBank, genome and protein databases, and the Entrez system for data retrieval. It also explains DNA sequence databases, genome mapping concepts, and the distinction between physical and genome mapping. Additionally, you will understand how to submit nucleotide sequence data to GenBank using tools like BankIt and Sequin, along with an overview of various NCBI resources widely used in the bioinformatics sector.
NCBI DATA MODEL
- The NCBI (National Center for Biotechnology Information) provides a comprehensive data model for organizing biological and biomedical data.
- This data model forms the foundation for various NCBI databases and resources, including GenBank, PubMed, and others. While the specific details may vary depending on the database or resource, here’s a general overview of the NCBI data model.
Bioinformatics Data Types
NCBI databases house a wide range of bioinformatics data types, including :
- Nucleotide sequences (DNA, RNA)
- Protein sequences
- Genomic annotations
- Taxonomic information
- Biological pathways
- Molecular structures
- Literature references
- Biological samples and experiments
Hierarchical Organization
- The NCBI data model often follows a hierarchical structure to organize biological data.
- For example, genomic data may be organized into chromosomes, genes, transcripts, and proteins.
- Taxonomic data may be organized into kingdoms, phyla, classes, orders, families, genera, and species.
Unique Identifiers
- Each entity in the NCBI data model is typically assigned a unique identifier.
- These identifiers facilitate cross-referencing and linking between different databases and resources.
- Examples include GenBank accession numbers for nucleotide sequences and PubMed identifiers (PMIDs) for literature references.
Metadata and Annotations
- NCBI databases include extensive metadata and annotations associated with biological data.
- This metadata may include information about sequence features, experimental conditions, sample characteristics, publication details, and more.
- Annotations provide context and additional information to aid in data interpretation and analysis.
Standardized Formats
- NCBI databases often use standardized formats for data representation, such as FASTA for sequences and XML or JSON for structured data.
- Standardized formats enable interoperability and compatibility with bioinformatics tools and resources.
Search and Retrieval
- NCBI databases provide powerful search and retrieval capabilities to access biological data.
- Users can query databases using keywords, sequence patterns, accession numbers, taxonomy terms, and other criteria.
- Search results are returned in a structured format, allowing users to filter, sort, and analyze the data.
Integration with Tools and Resources
NCBI databases are integrated with a wide range of bioinformatics tools, resources, and services.
This integration enables users to perform various analyses, visualize data, compare sequences, and access related information seamlessly.
STRUCTURE DATABASE
Structure Databases are specialized repositories storing three-dimensional structural information of biological macromolecules like proteins and nucleic acids.
They Contain three-dimensional structural information of biological macromolecules, like shape, conformation, and interaction sites.
Some of the examples of structure databases are PDB ,MMDB ,SCOP ,CATH etc.
They also Offers visualization tools and algorithms for structural comparison and analysis.
They are essential for : –
- Structural Biology Research – Enabling researchers to study the three-dimensional structures of bio-molecules to understand their function, interaction, and evolution.
- Drug Design and Development – Facilitating structure-based drug design by providing insights into the molecular targets for drug binding.
- Function Prediction – Assisting in predicting the functions of newly discovered proteins based on structural homology to known proteins.
Features and Functions of Structure Databases
- Hierarchical Classification – Databases like CATH and SCOP provide hierarchical classification of structures to understand their evolutionary relationships.
- Visualization Tools – It Offer tools for visualizing three-dimensional structures to analyze molecular interactions and conformations.
- Advanced Search Options – Facilitate efficient retrieval of structures based on various parameters like sequence, name, or function.
- Regular Updates and Annotations – Continuously updated with new structures and provide detailed annotations about each entry.
PDB
โ PDB is a first structure database.
โ It was established at Brookhaven national laboratory (NBL)by Walton Hamilton in 1971.
โ Later PDB was managed by research Collaboratory of structural bioinformatics(RCSB).
โ RCSB consist of researchers from Rutgers university , San Diego supercomputer centre at university of California and the National institute of standards and technology.
โ PDB is a primary structure database of various biomolecules such as nucleic acid , protein ,carbohydrate and their complexes.
โ PDB contains 3D structure of protein that is established by x-ray crystallographic and nuclear magnetic resonance (NMR) studies and is maintained by Research Collaboratory for Structural Bioinformatics (RCSB) at Rutgers University.
โ Nowadays , PDB is not accepting the theoretical structure model.
โ Protein structure data can be deposited in the PDB using a web-based AutoDep Input Tool (ADIT)
โ ADIT is a tool which allows researcher or scientists to deposit their experimentally determined structure.
โ As per the latest update PDB contains 200,708 structures.
โ PDB stores the structure data as per the CIF format called mmCIF(macromolecular crystallographic information file).
โ Molecular structure of protein of PDB can be displayed by molecular graphics program such as Rasmol , chime , Pymol ,Cn3D etc.
โ URL is http://www.rcsb.org/pdb/.
MMDB
โ MMDB stands for molecular modelling database.
โ It provides information about the 3D structures of biological macromolecules ,Including information about their chemical composition, sequence and interactions.
โ It a secondary structure database for biomolecules at NCBI and is a part of the NCBIโs entrez system of integrated database.
โ The database contain structures from experimental sources such as X-ray crystallography and NMR spectroscopy as well as computational predictions.
โ The data of MMDB is derived from PDB which can be accessed through PDB ID.
โ The structures of MMDB are filled with missing information in PDB file such as sequence residue information and missing coordinates of different atoms.
โ MMDB provides a visualisation tool called Cn3d to visualise it structure .
โ It is also used for molecular docking studies , drug design ,and molecular dynamics simulations.
โ Data in MMDB are store with .Cn3 file extension .
SCOP
โ It stands for structural classification of protein.
โ It is another resource offering a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known.
โ Contains detailed hierarchical classification of protein structures available in the PDB.
โ Provides open access to its classified structures and is a valuable resource for studying evolutionary relationships.
โ It is constructed primarily through visual comparison and manual grouping of structure from PDB.
โ It was started by the lab of molecular biology, MRC, Cambridge, UK with the purpose classifying protein 3D structures in a hierarchical scheme of structural classes.
โ It is maintained manually, and all protein structures in the PDB are classified.
โ This database is updated regularly.
โ The database take the information present in PDB and adds a layer of information in terms of analysis.
โ The organization of the database is such that proteins are arranged in accordance to their evolutionary, functional and structural relationships, the basic unit being a protein domain.
โ The hierarchy level of SCOP are class, family, super-family and fold.
Class – it has seven classes that are based on secondary structure content and size.
Fold – the next level of SCOP hierarchy is fold. Structure are defined as having common fold if they have some major secondary structure with similar arrangements and with same topological connection. Being grouped in same fold does not mean that proteins evolved from a common ancestor. Fold have been divided into five classes,
- All alpha
- All beta
- Alpha or beta
- Alpha and beta
- Multidomain folds
Super-family – if two proteins are assumed to belongs to a common super family if they belongs to same class, same fold and they have high level of sequence and structural similarity. It is suspected that they may have a common evolutionary origin.
Family – the last hierarchy level in scop is called family. Two proteins are belongs to a common family if there is a clear evolutionary relationship such as similar function, similar structure and sequence similarity.
CATH
โ CATH stands for class, architecture, topology, homology.
โ It is a protein structure classification database.
โ It is a manually curated database providing hierarchical classification of protein domain structures.
โ It Classifies proteins structures retrieved from PDB into a hierarchical framework.
โ It Allows free access to its classified structures and offers tools for structural analysis.
โ CATH helps researchers to understand the functional and evolutionary relationships among proteins and their underlying 3D structures.
โ CATH has four major classes
Class (C) – The broadest level of classification representing different structural and functional groups of proteins.
- class is assigned automatically to more than 90% of the proteins that are present in this database.
- it recognises three major classes; mainly-alpha, mainly-beta and alpha-beta
Architecture(A) – this category describes the overall shape of the domain structure. The overall shape is derived from orientations of secondary structures but ignore the connectivity between them.
Topology(T) – this category describes the detailed arrangement of individual secondary structural elements within a protein architecture. - this level of hierarchy is the ‘fold’ level.
- structures are grouped into fold families after considering the overall shape and connectivity of the secondary structure.
- e.g. if two proteins belongs to same class and architecture and their connectivity and folding pattern structure are same then they belongs to same topology.
Homologous superfamily(H) – regroups proteins with high similar structures and functions. if two proteins belongs to same class, architecture and topology then they have been evolved from same ancestor protein which can be verified by high level of sequence similarity then they belongs to same homologous superfamily.
GenBank
โ It is the NCBI primary nucleotide sequence database and a part of INSDC.
โ Genbank is one of the fastest growing repositories of known nucleotide sequence.
โ It was created in 1979 at the Los Alamos national laboratory and called the Los Alamos sequence database .
โ It was renamed Genbank in the year 1982 and became public database.
โ NCBI began accepting direct submissions to Genbank in 1993.
โ It is Comprehensive ,public database of nucleotides.
โ It Supports bibliographic and biological annotation.
โ It is an open access and annotated collection of nucleotide sequences.
โ It is maintained by national institute of health (NIH).
โ It can be accessed and searched through the entrez gateway at NCBI.
โ url- http://www.ncbi.nlm.nih.gov/genbank/.
โ Genbank has a flat file structure. It contains header ,features table , and sequence.
โ Genbank release occurs every two months which are available on the ftp site.
โ The latest genbank release note of release 253 for December 2022 is available on the NCBI FTP site.
โ As of 12 december 2022, genbank release 253 has 21.38trillion bases and 3.25 billion records.
โ Between release 252.0 and 253.0 an average of 11,763 records were added and updated per day.
โ The โASN.1โ and flat file formats are used on this database mostly.
โ Sequence can be submitted through Bankit tool available at NCBI website.
โ There are several ways to search and retrieve data from GenBank as given under โ
- Search GenBank for sequence identifiers and annotations with Entrez Nucleotide , which is divided into three divisions: CoreNucleotide (the main collection), dbEST (Expressed Sequence Tags), and dbGSS (Genome Survey Sequences).
- Search and align GenBank sequences to a query sequence using BLAST.
- Search, link, and download sequences programmatically using NCBI e-utilities .
Genbank file format
โ The Genbank format allows for the storage of information in addition to a DNA/protein sequence.
โ GenBank uses a flat file format that is an ASCII text file, readable and downloadable by both humans and computers.
โ The GenBank file format contains several fields that provide information about the sequence, such as its name, accession number, source organism, features, references, and annotations.
โ The sequence itself is marked by the word ORIGIN and ends with two slashes (//).
โ It holds much more information than the FASTA format. Formats similar to Genbank have been developed by ENA (EMBL format) and by DDBJ (DDBJ format).
LOCUS – The LOCUS field contains a number of different data elements, including locus name, sequence length, molecule type, GenBank division, and modification date.
โ Locus name – short name of the sequence
โ Sequence Length – Number of nucleotide base pairs (or amino acid residues) in the sequence record
โ Molecule Type – The type of molecule that was sequenced.
โ GenBank Division – The GenBank division to which a record belongs is indicated with a three letter abbreviation.
โ Modification Date – The date in the LOCUS field is the date of last modification.
DEFINITION – Brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence’s function (if the sequence is non-coding).
ACCESSION – The unique identifier for a sequence record.
VERSION – A nucleotide sequence identification number that represents a single, specific sequence in the GenBank database.
KEYWORDS – Word or phrase describing the sequence. If no keywords are included in the entry, the field contains only a period.
SOURCE – Free-format information including an abbreviated form of the organism name, sometimes followed by a molecule type.
ORGANISM – The formal scientific name for the source organism (genus and species, where appropriate) and its lineage, based on the phylogenetic classification scheme used in the NCBI Taxonomy Database.
REFERENCE – Contains literature citations for the sequence, including authors, title, journal, and publication details.
AUTHORS – List of authors in the order in which they appear in the cited article.
TITLE – Title of the published work or tentative title of an unpublished work.
FEATURES – Information about genes and gene products, as well as regions of biological significance reported in the sequence. These can include regions of the sequence that code for proteins and RNA molecules, as well as a number of other features.
ORIGIN – Includes the actual sequence data, presented as a series of nucleotides or amino acids.
// (Double Slash) – Marks the end of the entry, signaling the conclusion of the sequence record.
Steps to submit nucleotide sequence data to GenBank
Prepare Your Data
- Ensure your nucleotide sequence data is complete and accurate.
- Annotate your sequences with relevant biological information, such as gene names, features, and source organism.
Format Your Data
- Use the correct file format, such as FASTA for sequences and Sequin or tbl2asn for more detailed annotations.
Register for a Submission Account
- If you do not already have an account, register for one at the NCBI Submission Portal.
Use Submission Tools
BankIt – A web-based submission tool for smaller submissions.
Sequin – A standalone software tool suitable for larger or more complex submissions.
tbl2asn – A command-line program that automates the creation of sequence records for submission to GenBank.
Fill Out the Submission Form
- Follow the prompts to provide information about your sequences, including organism name, source, and any relevant publication information.
- Upload your sequence data and any additional files, such as annotation files or sequence alignments.
Review and Submit
- Carefully review your submission for accuracy.
- Submit your data through the chosen tool.
Receive Accession Number
- After submission, you will receive a unique accession number for your sequence data. This number can be used to track your submission and reference your data in publications.
Respond to Feedback
- If there are any issues with your submission, you may receive feedback from GenBank staff. Address any comments or requested revisions promptly.
Genbank data submission tool
SEQUIN
โ It is stand-alone software tool developed by NCBI.
โ It used for submitting and updating sequences to the Genbank , EMBL and DDBJ databases.
โ It has the capacity to handle long sequences and sets of sequence like
- Segmented entries
- Multiple annotations
- Population , phylogenetic and mutation studies
โ It also allows sequence editing and updating , and provides complex annotation capabilities.
โ It is more sophisticated software and has advanced features like - Graphical viewing
- Automatic annotations of complex sequences
- Built-in -validation functions for enhanced quality assurance and editing features
โ Key feature are :-
- It is stand-alone sequence submission tool.
- Is designed by NCBI.
- Used as submission tool for complex biological sequences.
WEBIN
โ This submission tool offers submission of single sequences and complex sequence in bulk.
โ It is used when rapid submission of sequences are required.
โ It is also an advanced software and can be used for multiple annotation and phylogenetic studies.
โ The Webin tool provides validation checks to ensure that submitted data is accurate and complete.
โ It supports multiple file formats , including FASTA, Genbank and EMBL among others.
โ SAR Webin is used for small scale submitters.
โ The following types of submission are supported by
- Genome assemblies
- Transcriptome assemblies
- Annotated sequence
- Read data submissions(Fastq, BAM, CRAM)
- Taxonomy reference sets
โ Key features are :-
- Web-based sequence submission tool
- Tool for submission to EMBL
- Submission tool for complex biological data
BANKIT
โ It can only be used to submit simple types of biological data.
โ It can only be used when the data does not involve any complicated annotations.
โ It cannot be used when advanced sequence analysis tools are required .
โ The different categories of information necessary for sequence submission are
- Reference number-authors name, publication
- Source information -organism genus species , taxonomic lineage ,cultured/uncultured
- Source category- original sequence /third party sequence
- Features-exons, introns ,CDS
- Sequence in FASTA format and of at least 200 nucleotides
โ Key features are :-
- It is web-based sequence submission tool
- It is used for submission to NCBI Genbank
- Step by step submission tool for simple data
NCBI AND ITS RESOURCES
โ NCBI stands for National centre for biotechnology information.
โ It is a multidisciplinary research group that serves as a resource for molecular biology information.
โ It is located in Bethesda ,Maryland ,USA.
โ It was established in 1988.
โ It is managed by National library of medicine(NLM) which is a part of National institute of health(NIH).
โ NCBI creates public database and it allows genuine public to submit their data .
โ The main goals of NCBI are : –
- To create and maintain public database.
- Develop softwares to analyse genomic data .
- To conduct research in computational biology.
PubMed
- PubMed is a free resource supporting the search and retrieval of biomedical and life sciences literature with the aim of improving healthโboth globally and personally.
- The PubMed database contains more than 36 million citations and abstracts of biomedical literature.
- It does not include full text journal articles; however, links to the full text are often present when available from other sources, such as the publisher’s website or PubMed Central (PMC).
- Available to the public online since 1996, PubMed was developed and is maintained by the National Center for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH).
- Citations in PubMed primarily stem from the biomedicine and health fields, and related disciplines such as life sciences, behavioural sciences, chemical sciences, and bioengineering.
- PubMed facilitates searching across several NLM literature resources :
| MEDLINE | PubMed Central (PMC) | Bookshelf |
|---|---|---|
| MEDLINE is the largest component of PubMed and consists primarily of citations from journals selected for MEDLINE; articles indexed with MeSH (Medical Subject Headings) and curated with funding, genetic, chemical and other metadata. | Citations for PubMed Central (PMC) articles make up the second largest component of PubMed. PMC is a full text archive that includes articles from journals reviewed and selected by NLM for archiving (current and historical), as well as individual articles collected for archiving in compliance with funder policies. Available to the public online since 2000, PMC was developed and is maintained by the National Center for Biotechnology Information (NCBI) at NLM. | The final component of PubMed is citations for books and some individual chapters available on Bookshelf. Bookshelf provides free online access to books and documents in life science and healthcare. Search, read, and discover. Bookshelf is a full text archive of books, reports, databases, and other documents related to biomedical, health, and life sciences. |
BLAST
- BLAST stands for Basic Local Alignment Search Tool.
- It is a sequence similarity search program that is used to compare a query sequence with sequence database and finds the similarity between them.
- BLAST is a heuristic method which means that it is a dynamic programming algorithm that is faster, efficient but relatively less sensitive.
- It finds regions of local similarity between sequences.
- The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.
- BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
- BLAST has different programs to align sequences of nucleotides, proteins, etc. It consists of other multiple BLAST programs, but the basic kinds of BLAST are as follows : –
- Blastn – It is a type of blast where the query sequence is a nucleotide and the target sequence is also a nucleotide, i.e., it is a nucleotide against a nucleotide.
- Blastp – Blastp is a protein to protein blast where the query sequence is a protein and the target sequence is also a protein.
- Blastx – In this type of blast, the query sequence is a nucleotide sequence and the target is a protein sequence/database. First, the nucleotide sequence is converted into its protein sequence in three reading frames, then it is searched against the protein.
- tblastn – in tblastn, the query is a protein and the target is a nucleotide sequence/database. Here, the protein sequence is searched against a nucleotide database which is translated to its corresponding proteins. The translation occurs in all reading frames, but the reading frame is only for the conventional 5โ to 3โ site in the databases, therefore, only 3 reading frames are compared.
- tblastx – It is a type of blast in which the nucleotide sequence is against the nucleotide database but at the protein level. In other words, the nucleotide query sequence and the target sequences are both translated into their corresponding protein sequences and then aligned together. Both the query and the target are translated in all 6 reading frames.
Nucleotide
- The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB.
- Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery.
PubChem
- Pubchem is the database of chemical molecules.
- PubChem is the world’s largest collection of freely accessible chemical information.
- It is maintained by NCBI.
- PubChem is an open chemistry database at the National Institutes of Health (NIH). โOpenโ means that you can put your scientific data in PubChem and that others may use it.
- Pubchem can be freely accessible through a web-user interface.
- We can Search chemicals by name, molecular formula, structure, and other identifiers.
- PubChem mostly contains small molecules, but also larger molecules such as nucleotides, carbohydrates, lipids, peptides, and chemically-modified macromolecules.
- It provides chemical and physical properties, biological activities, safety and toxicity information, patents, literature citations and more.
- Since the launch in 2004, PubChem has become a key chemical information resource for scientists, students, and the general public.
Genome
- This resource organizes information on genomes including sequences, maps, chromosomes, assemblies, and annotations.
dbSNP
- dbSNP contains human single nucleotide variations, microsatellites, and small-scale insertions and deletions along with publication, population frequency, molecular consequence, and genomic and RefSeq mapping information for both common variations and clinical mutations.
Protein
- The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB.
- Protein sequences are the fundamental determinants of biological structure and function.
CDD
- The Conserved Domain Database is a resource for the annotation of functional units in proteins.
- Its collection of domain models includes a set curated by NCBI, which utilizes 3D structure to provide insights into sequence/structure/function relationships.
RefSeqGene
- A collection of human gene-specific reference genomic sequences.
- RefSeq gene is a subset of NCBIโs RefSeq database, and are defined based on review from curators of locus-specific databases and the genetic testing community.
- They form a stable foundation for reporting mutations, for establishing consistent intron and exon numbering conventions, and for defining the coordinates of other biologically significant variation.
- RefSeqGene is a part of the Locus Reference Genomic (LRG) Collaboration
Biological database retrieval system
โ Data retrieval means obtaining data from a database management system.
โ It is considered that the data is represented in a structural way and there is no ambiguity in data.
โ In order to retrieve the desired data the user present a set of criteria by a query.
โ There are three important data retrieval systems:
โช Entrez (at NCBI),
โช Sequence Retrieval System, SRS (at EBI)
โช DBGET/LinkDB (at Japan).
โ Entrez and SRS are most popular retrieval system for biological databse that provides access to multiple databases for retrieval integrated search results.
โ These retrieval systems not only return matches to a query, but also provide additional important information in related databases.
โ To perform complex queries in a database often requires the use of Boolean operators .Keywords for Boolean search during data retrieval are AND , OR and NOT
โช AND means that search result must contain both words
โช OR means to search for results containing either word or both
โช NOT exclude results containing either one of the words
ENTREZ
โ Entrez is a molecular biology database and retrieval system.
โ It is developed and maintained by the NCBI.
โ It allows text-based searches for wide variety of data including annotated genetic sequence information ,structural information as well as citation and abstracts ,full papers and taxonomic data.
โ Boolean operator is used for text searching of sequences or bibliographic records.
โ It is also known as integrated database and it is cross-referenced .
โ It is entry point for exploring distinct but integrated database.
โ The key feature of ENTREZ is its ability to integrate information , which comes from cross -referencing between NCBI databases .
โ Entrez is easy to use and highly convenient , users do not have to visit multiple database , as it displays all the links of information in single page. Foe e.g. in a protein sequence page one may finds the link of abstracts, blast tool etc.
โ The Entrez results can be viewed in various formats like FlatFile, FASTA, XML, and others.
โ Entrez can be accessed through http://www.ncbi.nlm.nih.gov/Entrez/.
โ The Entrez system provides access to Nucleotide sequence databases โ GenBank/DDBJ/EBI; Protein sequence databases – Swiss-Prot, PIR, PRF, PDB, and translated protein sequences from DNA sequence databases; Genome and chromosome mapping data; Molecular Modeling 3-D structures Database; Literature database, PubMed – provides access to MEDLINE and pre-MEDLINE articles; Taxonomy database – allows retrieval of DNA and protein sequences for different taxonomic groups; Specialized Databases โ OMIM, dbSNP, UniSTS, etc.
โ The most valuable feature of Entrez is- by exploiting the concept of โneighbouringโ it provides access to related articles of linked databases.
โ For example, in a nucleotide sequence page, one may find cross-referencing links to the translated protein sequence, genome mapping data or to the related PubMed literature information and to protein structures if 10 available (Xiong, 2006).
โ Another useful feature in Entrez is โ one can retrieve large sets of data on the basis of some criterion and can download them to a local computer.
GENOME MAPPING
โ A genome is the complete set of genetic information in an organism. It provides all of the information the organism requires to function.
โ In living organisms, the genome is stored in long molecules of DNA called chromosomes. Small sections of DNA, called genes, code for the RNA and protein molecules required by the organism.
โ Genome mapping is the process of finding the locations of genes on each chromosome. The maps created by genome mapping are comparable to the maps that we use to navigate streets.
โ A genetic map is an illustration that lists genes and their location on a chromosome.
Objectives of Genome Mapping
In a Genomic Map we Aim at Finding out Following Two Things : –
- The determination of linear order with which genetic units are arranged with respect to one another (gene order). For example, if there are three genes A, B and C then a genetic map determines their order of arrangement, i.e., whether they are arranged as ABC, ACB, BCA, BAC, CAB or CBA.
- To determine relative distance between genetic units (gene distance).
Importance
Genome mapping is an important genetic tool that has various important uses such as : –
- It can identify genes that are near to each other and therefore understand their recombination frequency and mode of transmission.
- It can be used to study the genes associated with a specific trait like eye colour in Drosophila.
- It is used to study genes associated with diseases.
- It is used to crate whole genome maps to understand the position and location of all genes in an organisms.
Types of mapping
- Genetic mapping / linkage mapping
- Physical mapping
Genetic Mapping and linkage analysis
โ Genetic mapping, often referred to as linkage mapping.
โ This map shows the relative positions of genes and other markers along a chromosome based on the frequency of their recombination. Genes that are closer together on a chromosome are more likely to be inherited together than genes that are further apart.
โ In genetic map the distance is showed in centimorgan.
โ It does not show the physical distance between the genes.
โ It is grounded in the principle that genes located in close proximity on a chromosome tend to be inherited together and are less likely to be separated during the recombination process.
Linkage analysis โ Linkage analysis is a technique to investigate the inheritance patterns of genes and genetic markers in and among families. Linkage analysis checks for co-inheritance of genetic markers and traits of interest to identify chromosomal areas likely to contain genes related to those genes. Linkage analysis is based on the idea that genes that are next to each other in the same chromosome are inherited or linked during meiosis. This condition is referred to as Genetic linkage. Genetic mapping utilizes linkage analysis to determine the location of genes by analyzing their transmission pattern among individuals to understand whether genes are close (strongly linked) or far (not linked).
Genetic analysis โ This technique is used to identify the gene associated with a
Limitations of Genetic Mapping
โช Genetic mapping does not provide information about the physical distance between genes as it provides only the outline.
โช Recombination levels can differ across various parts of the genome. Some parts exhibit high rates of recombination, known as recombination hotspots, which can impact the accuracy of genetic mapping.
โช In organisms where obtaining a large number of progenies is difficult, the resolution of a genetic map is limited because it depends on the number of observed crossovers.
Physical Mapping
โ The physical map shows the actual physical location of gene and other DNA sequences along a chromosome.
โ It is based on the number of base pair between the gens and other landmark on the chromosome.
โ It provides a more detailed and precise representation of the genomic landscape.
โ Physical maps are the more precise than physical map.
โ There are various techniques available for physical mapping such as restriction mapping ang cytogenetic mapping.
- Restriction mapping – Restriction mapping involves the use of restriction enzymes to cut DNA at specific sequences followed by the analysis of resulting fragments. By comparing the size of fragments produced by different restriction enzyme the order and distance between restriction site can be determined which help to create the physical map of the genome.
- Cytogenetic mapping โ A cytogenetic map is a type of genome map that specifically focuses on the physical location of gene and other genetic elements on chromosome. It provides the visual representation of chromosome indicating the position of genes, structural variation and other features. It involves staining chromosome with specific dye to reveal characteristic bending patterns which can help in identifying individual chromosomes and their regions.
Limitations of Physical Mapping
โช DNA fragments can be incorrectly mapped due to fragment breakage, deletion during replication, or contamination with host genetic material.
โช There may be missing or incomplete coverage of DNA fragments in the mapping process, resulting in gaps.
โช Restriction mapping method cannot be applied to large genomes.
โช Physical mapping using FISH is difficult to carry out and the process of accumulating data is slow. In a single experiment, only a limited number of map positions can be obtained.
Applications
1) Disease Gene Mapping – Gene mapping can identify genetic variants associated with various diseases, including cancer, cardiovascular disorders, and neurodegenerative conditions.
2) Pharmacogenomics – Knowing the position of a defective or disease-causing gene allows the development of targeted, personalized medications, known as pharmacogenomics.
3) Crop Improvement – In agriculture, gene mapping is utilized to identify genes responsible for desirable crop traits, such as disease resistance, yield potential, and nutritional quality. This knowledge aids in developing genetically modified (GM) crops with enhanced characteristics and yields.
4) Forensic Genetics – Gene mapping techniques are employed in forensic genetics to analyze DNA evidence and identify individuals in criminal investigations, paternity testing, and mass disaster victim identification.
Mapping databases
The Genome Database
โ The Genome Database (GDB) is the official central repository for genomic mapping data created by the Human Genome Project.
โ GDBโs central node is located at the Hospital for Sick Children (Toronto, Ontario, Canada).
โ Members of the scientific community as well as GDB staff curate data submitted to the GDB.
โ Currently, GDB comprises descriptions of three types of objects from humans : –
- Genomic Segments โ it includes genes, clones, amplimers, breakpoints, cytogenetic markers, fragile sites, ESTs, syndromic regions, contigs, and repeats.
- Maps โ it includes cytogenetic ,Genetic linkage , RH, STS-content, and integrated maps.
- Variations โ it represents the genetic polymorphisms and polymorphism site in human genome.
โ GDB provides a full-featured user-friendly query interface to its database with extensive online support.
โ It provides focused query interfaces and predefined reports, such as the Maps within a Region search and Lists of Genes by Chromosome report.
โ GDBโs Mapview program provides a graphical interface to the genetic and physical maps available at GDB.
โ GDB website provides a Simple Search interface on its home page. This queries used when searching for information on a specific genomic segment, such as a gene or STS (Sequenced Tagged Site) and can be implemented by entering the segment name or GDB accession number.
MGI/MGD
โ The Mouse Genome Initiative Database (MGI) is the primary public mouse genomic catalogue resource.
โ Located at The Jackson Laboratory, the MGI currently encompasses three cross-linked topic-specific databases : –
- The Mouse Genome Database (MGD),
- The mouse Gene Expression Database (GXD), and
- The Mouse Genome Sequence project (MGS).
โ The MGD has evolved from a mapping and genetics resource to include sequence and genome information and details on the functions and roles of genes and alleles.
โ MGD includes information on mouse genetic markers and nomenclature, molecular segments (probes, primers, YACs and MIT primers), phenotypes, comparative mapping data, graphical displays of linkage, cytogenetic, and physical maps.
โ As of Aug 2018, there were over 40,500 genetic markers and 11,600 genes in MGD, with 85% and 70% of these placed onto the mouse genetic map, respectively.
โ Over 4,800 genes have been matched with their human orthologue and over 1,800 matched with their rat orthologue.
Genetic mapping vs Physical mapping (Comparison Table)
| Genetic mapping | Physical mapping | |
|---|---|---|
| 1 | It is the process of determining the order and relative distance between genetic markers on a chromosome. | It is a method of determining the order and physical distance between DNA base pairs. |
| 2 | It Depend on Combination and crossing over. | It depend on DNA sequence on the genome. |
| 3 | Unit is Centimorgans (cM) | Unit is Base pairs (bp) |
| 4 | It uses genetic markers to map the distance between two genes. | It uses restriction enzymes to cut the specific sequence of DNA. |
| 5 | It depicts the region of polymorphisms (region where the DNA sequence differs) in different individuals. | It depicts the actual distance of base pairs along a stretch of DNA. |
| 6 | Less accurate due to variations in recombination rates | More accurate as it directly measures physical distances |
| 7 | Provide evidence on a genetic disorder | Help to identify the origin of a disease |
| 8 | Different Techniques – Linkage analysis, genetic markers, and recombination mapping | Different Techniques – Fluorescence in situ hybridization (FISH), cytogenetic mapping, and DNA sequencing |
| 9 | Different applications are Understanding gene linkage, genetic mapping, and inheritance patterns | Different applications are Genome sequencing, identifying structural variations, and studying physical characteristics of chromosomes |
Cross-Database Global Query & Discovery Engine



