The databases that underpin modern cancer genomics have a geography problem. The gnomAD database the gold standard for allele frequency data used by geneticists worldwide draws roughly 64% of its reference population from people of European ancestry. South Asians, who make up nearly a quarter of humanity, account for around 12%. The 1000 Genomes Project, another widely used reference, is similarly skewed. When a clinical geneticist in Mumbai or Chennai opens a variant interpretation tool and runs a patient’s sequencing data through it, the tool’s sense of “rare” and “common” was calibrated almost entirely on someone else’s genome.
The consequences are not theoretical. A variant that appears rare in European populations — and therefore gets flagged as potentially pathogenic might be perfectly ordinary in South Asian ones. The patient gets a false alarm. Conversely, a genuinely damaging variant common enough in European databases to be filtered out as “benign” might actually be rare and dangerous in the Indian context. It gets missed entirely.
This isn’t a new observation. Researchers have been raising it for years. What has been rarer is someone building a concrete computational solution and putting it to work on real data.
Why “Rare Variant” Means Something Different Depending on Whose DNA You Measured
To understand why this matters, it helps to understand how variant interpretation pipelines work.
When a patient undergoes tumour sequencing, the result is a list of thousands of genetic variants — single nucleotide changes, insertions, deletions — scattered across the genome. Most are irrelevant to cancer. The job of a bioinformatics pipeline is to filter that noise and surface the variants most likely to be driving disease.
One of the most important filters is allele frequency: how common is this variant in the general population? If a variant appears in 20% of healthy Europeans, it’s almost certainly not causing cancer. If it’s never been seen before, or appears in fewer than 1 in 100 people, it’s worth investigating.
The problem is that “how common is this variant in the general population” is only as good as whose population you measured. A variant with a gnomAD global frequency of 0.5% might have a South Asian-specific frequency of 3% — common enough in India to be almost certainly benign, but rare enough globally to get flagged as suspicious.
MutExpress-India specifically uses gnomAD_SAS_AF — the South Asian allele frequency field — as its primary rarity filter, rather than the global or European frequencies that most pipelines default to. Variants must be rare in South Asians specifically (below 1% frequency) to pass through.
That single change in filter logic is the conceptual foundation of the whole project.
How MutExpress-India Combines Genomic and RNA-Seq Data to Rank Breast Cancer Gene Candidates
Filtering on South Asian allele frequency alone would still leave tens of thousands of variants. To narrow the list further and to do it in a way that’s biologically meaningful — the pipeline adds a second layer: RNA sequencing data.
The logic is straightforward. A variant in a gene is more likely to matter if that gene is also behaving abnormally in tumour tissue. A gene that carries a rare, predicted-damaging variant AND shows significant differential expression in breast cancer tumours has two independent lines of evidence pointing at it. Genes with only one line of evidence, or neither, get ranked lower.
For the expression layer, the pipeline uses GSE62944 a curated dataset of TCGA RNA-Seq read counts covering 1,119 BRCA tumour samples and 113 matched normals. DESeq2, the standard negative binomial model for differential expression analysis, identified 3,306 genes showing statistically significant expression changes between tumour and normal tissue (adjusted p-value below 0.05, fold change above 1.5 in either direction).
For the variant layer, the GDC API was queried to download 504 TCGA-BRCA Masked Somatic Mutation MAF files — one per patient — and merged into a single dataset of 47,844 somatic mutations. These were filtered using the South Asian allele frequency threshold combined with pathogenicity predictors: SIFT, PolyPhen, and VEP IMPACT scores.
The two filtered lists — 36,106 rare-in-SAS damaging variant candidates and 3,306 significant DEGs — were merged using an outer join. Every gene received one of four priority tiers:
HIGH: Rare damaging SAS variant and significant DEG. 1,702 genes. MEDIUM-V: Rare SAS variant only, no matched expression change. 11,034 genes. MEDIUM-E: Significant DEG only, no matching rare variant. 1,604 genes. LOW: Neither criterion met.
The 1,702 HIGH-tier genes are the core output — the candidates that warrant the closest clinical and experimental attention in the Indian breast cancer context.
Tested Against COSMIC: Does the Pipeline Actually Find Real Cancer Drivers?
Numbers mean nothing without a reality check. Did the pipeline actually surface genes that are known to matter in breast cancer?
To test this, an ablation study was run against the COSMIC Cancer Gene Census — a curated list of genes with strong evidence for roles as cancer drivers, assembled by the Sanger Institute. Twenty known BRCA driver genes from this list served as a gold standard to measure how many each version of the pipeline recovered.
The results were telling.
The variant-only approach flagged 12,736 genes and found 19 of the 20 known drivers — excellent recall at 95%. But at that scale, precision was only 0.15%. Most of those 12,736 genes are noise.
The expression-only approach flagged 3,306 genes and recovered 4 of the 20 drivers — recall drops, but the list is smaller.
The dual-layer MutExpress approach flagged 1,702 genes. It also found 4 of the 20 known drivers — the same as expression-only. But it did so from a list 87% shorter than the VCF-only result. Precision rose to 0.24%, a 4.4-fold improvement over using variants alone.
The interpretation matters. MutExpress-India doesn’t claim to find more cancer drivers than existing tools. Its shorter, more precise list is simply more useful. When you have 1,702 candidates instead of 12,736, the proportion that are genuinely relevant is higher. Experimental follow-up — expensive, slow, and still done by human researchers — becomes more tractable.
The precision improvement also validates the dual-layer logic. The expression filter is doing real biological work, not just adding noise on top of the variant filter.
The Pathway Analysis: What the 1,702 HIGH-Priority Genes Are Actually Doing in Tumour Biology
A useful sanity check for any gene prioritization pipeline is pathway enrichment — asking whether the genes it surfaces cluster in biological processes known to be relevant to the cancer type.
The 1,702 HIGH-priority genes were analysed using clusterProfiler (a Bioconductor R package) for Gene Ontology and KEGG pathway enrichment. The top enriched GO Biological Process terms included nuclear division, mitotic sister chromatid segregation, extracellular matrix organisation, and hormone metabolic process. The KEGG results highlighted cadherin signalling, integrin signalling, ECM-receptor interaction, and hormone signalling among the most significantly enriched pathways.
These are not random hits. Cell division and chromosome segregation failures are defining features of cancer. Cadherin and integrin signalling govern cell adhesion and are directly implicated in tumour invasion and metastasis — the processes responsible for most breast cancer deaths. ECM dysregulation underpins the altered tumour microenvironment that makes breast cancers particularly aggressive. Hormone signalling is fundamental to oestrogen-receptor-positive BRCA, the most common subtype.
The pathway results don’t prove that the 1,702 genes are all genuine drivers. But they strongly suggest that the list isn’t random — it’s enriched for genes operating in exactly the processes you’d expect to be disrupted in breast cancer.
The enrichment analysis also passes what researchers call the non-circularity test: the GO and KEGG databases were never consulted during the filtering pipeline itself. Pathway enrichment was done after the fact, as an independent validation step. Finding biologically coherent results is a genuine positive signal, not a tautology built into the method.
BRCA1, BRCA2, TP53, PIK3CA, ERBB2: The Known Cancer Genes That Made the Cut
Among the 1,702 HIGH-priority genes, several well-established cancer names appear prominently. BRCA1 and BRCA2 — the most studied breast cancer susceptibility genes — are both present. So are TP53 (mutated in roughly 30% of all cancers), PIK3CA (one of the most frequently mutated genes in breast cancer specifically), and ERBB2 (also known as HER2, the target of trastuzumab therapy).
Their presence is strong internal validation: a pipeline calibrated on South Asian frequency data still recovers the genes every breast cancer researcher knows about. It hasn’t thrown the baby out with the bathwater.
What the pipeline additionally surfaces is a large set of less-characterised genes that pass both the SAS rarity and expression criteria but haven’t been studied as thoroughly in the breast cancer context. These are the scientifically interesting ones candidates that might matter specifically in Indian patients, or that have been overlooked in primarily European datasets. Investigating them is future work, and exactly the kind of work this pipeline was built to enable.
The Live Tool: Running MutExpress-India on Your Own Cancer Genomics Data
MutExpress-India in Action
Every section of the live Streamlit dashboard and companion data converter — annotated for researchers and clinicians evaluating the pipeline.

Priority Tier Distribution — All 14,340 Genes Classified
Hero dashboard with four tier metric cards, donut chart of priority distribution, and bar chart of the top 15 most-mutated genes in the HIGH tier. MEDIUM-V dominates at 76.9% (rare SAS variant, no matched DEG), while HIGH at 11.9% represents the dual-evidence core output. All HIGH-tier genes carry at least one rare South Asian variant AND are significantly differentially expressed in TCGA-BRCA tumours.

Expression Change vs. Variant Severity — 1,702 HIGH Priority Genes
X-axis: log₂ fold change from DESeq2 tumour vs. normal (GSE62944, 1,119 BRCA tumours vs. 113 normals). Y-axis: damage score (1–3), sum of SIFT, PolyPhen, and VEP IMPACT predictions.
Red dots = upregulated. Blue dots = downregulated. Top-right quadrant = strongest dual-evidence candidates.

Gene Ontology Enrichment — Top 15 Biological Processes
GO Biological Process enrichment via clusterProfiler on the 1,702 HIGH-priority genes. Dot size = gene count per term; dot colour = adjusted p-value. Top enriched terms: regulation of membrane potential, nuclear division, ECM organisation, hormone metabolic process.

KEGG Pathway Enrichment — Top 15 Disease Pathways
KEGG enrichment via enrichKEGG. Most significant: Neuroactive ligand–receptor interaction. Also enriched: Cadherin, Integrin, ECM-receptor interaction, Hormone signalling, and ABC transporters (drug resistance).

Ablation Testing vs. COSMIC CGC — 4.4× Precision Improvement
Benchmarked against 20 known BRCA driver genes from COSMIC Cancer Gene Census — an independent gold standard never consulted during pipeline construction. VCF-Only: 12,736 genes at 0.15% precision. Dual-Layer MutExpress-India: 1,702 genes at 0.24% precision — a 4.4× improvement with an 87% shorter candidate list.

Run the Pipeline on Your Own Data — Any Cancer Type
Upload a variant file (TSV) and a DEG file (CSV). The dashboard runs the full dual-layer integration, assigns priority tiers, and — if HIGH tier has 5+ genes — runs live GO and KEGG enrichment via gseapy. Output downloadable as CSV.
Fully disease-agnostic: works for type 2 diabetes, cardiovascular disease, tuberculosis susceptibility, or any condition with South Asian variant and expression data.

Convert Variant File — MAF, VCF, CSV, XLSX
Upload any mutation file from cBioPortal, GDC, or TCGA. Auto-detects columns. Produces priority_variants.tsv.

Convert Expression File — DESeq2, edgeR, GEO, XLSX
Upload any DEG results file. Converts Ensembl IDs to HGNC via API. Produces significant_degs.csv.
priority_variants.tsv → Converter Tab 02 → significant_degs.csv → upload both to main dashboard → priority tiers + GO/KEGG enrichment for any cancer type.The pipeline isn’t just a one-time analysis — it’s a fully deployed, publicly accessible tool.
The main dashboard runs as a Streamlit web application at mutexpress-india.streamlit.app. It shows the full results interactively: priority tier distribution, a scatter plot of all 1,702 HIGH-priority genes positioned by log₂ fold change and damage score, GO and KEGG enrichment dot plots, and the ablation validation table. Every chart is interactive and explorable in the browser with no installation required.
A companion tool — the MutExpress Converter — handles the practical problem that real genomic data arrives in many formats. Researchers working with cBioPortal downloads, GDC MAF files, custom VCFs, GEO expression tables, or XLSX spreadsheets can upload their data and have it automatically converted into the format MutExpress expects. The converter auto-detects column names, strips formatting artefacts that R sometimes introduces to CSV outputs, and converts Ensembl IDs to HGNC gene symbols via the mygene.info API where needed.
The upload tab in the main dashboard then lets any researcher run the dual-layer integration on their own data — variant files from any cancer type, expression files from any cohort — and receive priority tier assignments, a distribution chart, and live GO/KEGG enrichment on the resulting HIGH-tier genes.
The whole pipeline is open source under the MIT licence on GitHub.
Where MutExpress-India Fits in the Broader Push Toward South Asian Precision Medicine
MutExpress-India sits within a growing recognition that precision medicine cannot fulfil its promise while its foundational databases are so demographically narrow.
Several large-scale efforts are underway to fix this at the database level. The IndiGen programme, run by CSIR’s Institute of Genomics and Integrative Biology, is building a reference genome dataset from across India’s diverse ethnic groups. The GenomeAsia 100K project has assembled whole-genome sequences from thousands of individuals across Asian populations. The Indian Genome Variation Consortium has been cataloguing genetic diversity across Indian subpopulations for over two decades.
What MutExpress-India does is different in character. It’s a tool that uses what already exists — gnomAD’s SAS frequency data, TCGA’s open-access mutation data, publicly available RNA-Seq counts — and applies it in a way that most clinical variant interpretation pipelines currently don’t. It doesn’t require waiting for a complete South Asian reference genome. It works now, with existing data, and produces results demonstrably better calibrated for Indian patients than the standard approach.
That’s not a small thing. India diagnoses roughly 200,000 new breast cancer cases per year. Most of those patients’ genetic data, if sequenced at all, will be interpreted using tools that weren’t built with them in mind. The gap between what precision medicine promises and what it actually delivers for non-European populations is real, measurable, and — as this project shows — not technically intractable.
Limitations, Next Steps, and the Path from Research Tool to Clinical Use
As a Master’s project, MutExpress-India is a proof of concept. It demonstrates that the dual-layer approach works, that South Asian allele frequency filtering improves precision, and that the pipeline recovers biologically coherent results. It is not yet a clinical tool.
Several things would need to happen before it gets there. The pipeline needs validation against Indian-specific cohort data — ideally sequencing data from Indian breast cancer patients, not TCGA-BRCA which is predominantly American. The 1,702 HIGH-priority genes need experimental follow-up to determine which are genuine drivers and which are passengers. And the tool needs stress-testing by clinical geneticists who would actually use it, not only bioinformaticians.
The pipeline is also, by design, disease-agnostic. The same two-layer approach could be applied to type 2 diabetes, cardiac disease, tuberculosis susceptibility, or any condition where Indian population-specific variant prioritization is relevant. Breast cancer was the test case because the data are publicly available and the known driver genes are well-characterised enough to validate against. The architecture doesn’t change for other diseases only the input data does.
Whether the tool finds its way into clinical use will depend on factors well outside the scope of a dissertation: regulatory pathways, institutional adoption, funding for validation studies, and the perennial challenge of getting computational tools into the hands of the clinicians who could benefit from them.
But the analysis is sound. The code is public. The tool works. And the problem it addresses that genomic medicine systematically underserves South Asian patients because its databases were built on other people’s DNA isn’t going away on its own.
MutExpress-India dashboard: mutexpress-india.streamlit.app
Data converter: mutexpress-india-converter.streamlit.app
Source code: github.com/Shibasis-Rath/mutexpress-india
Researcher: Shibasis Rath, MSc Bioinformatics, RathBiotaClan (rathbiotaclan.com)
Data sources: TCGA-BRCA Masked Somatic Mutation MAF files (GDC Portal, open access); GSE62944 TCGA RNA-Seq normalized counts (NCBI GEO); gnomAD v3.1.2 South Asian allele frequencies (Broad Institute); COSMIC Cancer Gene Census (Sanger Institute). All code and results are publicly available under MIT licence.
















