Documentation
Complete project documentation generated from DOCUMENTATION.md.
Cancer Transcriptomics ML β Complete Documentation¶
Website: cancertranscriptomics.space Author: Polat BakΔ±r Purpose: Research & educational platform integrating machine-learning tumour classification with evolutionary analysis of gene signatures across 5 TCGA cancer types (BRCA, BLCA, PRAD, LUAD, UCEC)
Table of Contents¶
- Project Overview
- Core Hypothesis
- Biological Background
- Data Sources
- Page-by-Page Guide
- 5.1 Overview Page
- 5.2 Models Page
- 5.3 Signatures Page
- 5.4 Evolution Page
- 5.5 Results Page
- 5.6 How It Works Page
- 5.7 Methods Page
- Key Metrics & Statistics Explained
- Evolutionary Analysis In Depth
- Candidate Gene Profiles
- Validation Framework [NEW]
- Statistical Methods Reference
- Limitations & Caveats
- Glossary
1. Project Overview¶
Cancer Transcriptomics ML is a computational biology platform that asks a deceptively simple question: Can the genes a machine-learning model uses to tell tumour tissue from normal tissue also teach us something about the evolutionary forces shaping cancer?
The platform performs two complementary analyses:
| Analysis Layer | What It Does | Timescale |
|---|---|---|
| Machine Learning Classification | Trains three different ML models to distinguish tumour from normal tissue using RNA-seq gene expression data | Present-day snapshot |
| Evolutionary Constraint Analysis | Measures how strongly natural selection has acted on the ML-identified genes, both across species (germline) and within tumours (somatic) | Millions of years (germline) to years/decades (somatic) |
The key insight is that genes identified purely by their expression patterns (ML) turn out to be under unusually strong evolutionary constraint β they encode proteins so important for cellular function that evolution has kept them nearly unchanged for 90 million years. Yet paradoxically, these same genes accumulate protein-altering mutations in cancer at rates far above neutral expectation. This "selection paradox" is the central finding of the project.
Technical Stack¶
- Backend: Python FastAPI with Jinja2 templates
- Frontend: Vanilla JavaScript + Plotly.js for interactive charts, D3.js for the mind map
- Data: TCGA (The Cancer Genome Atlas) RNA-seq expression, somatic mutations, and Ensembl ortholog sequences
- ML Models: Logistic Regression, Random Forest, Multi-Layer Perceptron neural network
2. Core Hypothesis¶
Cancer-Maintaining Dependencies Hypothesis: Genes that are (a) identified by ML as predictive of tumour state, (b) deeply conserved across mammalian evolution (low germline dN/dS), and (c) under positive selection in tumours (high somatic dN/dS) represent core cellular functions that cancer depends on for survival.
This hypothesis rests on three pillars:
- ML Prediction β The gene's expression level reliably differs between tumour and normal tissue, meaning cancer consistently alters this gene's activity.
- Germline Conservation β The gene's protein has been kept nearly identical for ~90 million years of mammalian evolution, meaning the protein does something so essential that most mutations to it are lethal and removed by natural selection.
- Somatic Positive Selection β Within breast tumours, the gene accumulates more protein-changing mutations than expected by chance, meaning tumour cells that mutate this gene gain a growth advantage.
The intersection of all three suggests these genes are not merely bystanders in cancer β they are load-bearing pillars that cancer cannot lose. Such genes are prime candidates for therapeutic targeting because cancer cells depend on their function.
The Three Hypotheses Tested¶
| Hypothesis | Statement | Assessment |
|---|---|---|
| H1: ML signatures are conserved | Genes predictive of tumour state encode highly conserved proteins (germline dN/dS < genome average) | β Supported β 96.4% under purifying selection, 80% under strong purifying selection |
| H2: Signatures are somatically selected | ML-predictive genes show elevated somatic dN/dS in TCGA cancer types | β Supported β known drivers (TP53, PIK3CA, GATA3) detected; 163 total candidates across 5 cancer types (BRCA=6, BLCA=12, PRAD=1, LUAD=28, UCEC=116); 15 cross-cancer validated [UPDATED] |
| H3: Dual-pressure genes are dependencies | Genes under both germline constraint and somatic positive selection represent cancer dependencies | β³ Hypothesis generated β requires experimental validation (e.g., CRISPR screens, DepMap integration) |
3. Biological Background¶
This section provides the foundational biology needed to understand every metric, visualisation, and interpretation on the website. It is written so that someone with basic science literacy β but no prior bioinformatics or cancer biology training β can follow the entire platform.
3.0 The Central Dogma: From DNA to Protein¶
All life depends on a simple information flow:
DNA β(transcription)β mRNA β(translation)β Protein
- DNA (deoxyribonucleic acid) is the permanent instruction manual stored in every cell's nucleus. The human genome contains ~20,000 protein-coding genes spread across 23 pairs of chromosomes (~3.2 billion nucleotide base pairs: A, T, G, C).
- mRNA (messenger RNA) is a temporary copy of a gene. When a cell needs a particular protein, it "transcribes" the gene's DNA into mRNA. The amount of mRNA for a gene in a cell reflects how actively that gene is being used β its expression level.
- Proteins are the molecular machines that do the actual work: enzymes catalyse reactions, transcription factors turn genes on/off, structural proteins build cell scaffolds, receptors receive signals, and antibodies defend against pathogens.
Why this matters for the project: RNA-seq measures mRNA levels for every gene. Because mRNA is the intermediate between the genetic blueprint (DNA) and the functional machinery (protein), measuring it tells us which parts of the genome each cell is actively using. Cancer cells use a very different set of genes compared to normal cells β and that difference is what the ML models detect.
3.0.1 What is a Gene?¶
A gene is a segment of DNA that contains the instructions for building one (or sometimes more) proteins. Key gene anatomy:
- Exons: The portions of a gene that encode protein sequence. When exons are stitched together, they form the coding DNA sequence (CDS).
- Introns: Non-coding stretches between exons; removed during mRNA processing ("splicing").
- Promoter: A regulatory region upstream of the gene that controls when and how much the gene is transcribed.
- Codon: A three-nucleotide unit within the CDS that specifies one amino acid. There are 64 possible codons encoding 20 amino acids plus 3 stop signals. This redundancy (multiple codons β same amino acid) is the basis for distinguishing synonymous from nonsynonymous mutations.
3.0.2 What is a Protein?¶
A protein is a chain of amino acids (typically 100β3,000 residues long) that folds into a specific three-dimensional structure. The shape determines function:
- Enzymes (e.g., kinases like PIK3CA): Catalyse biochemical reactions. A single amino acid change in the active site can destroy enzymatic activity or β in cancer β lock it permanently "on."
- Transcription factors (e.g., GATA3, FOXA1, TP53): Bind specific DNA sequences to activate or repress target genes. Mutations in their DNA-binding domains can alter which genes they regulate.
- Structural proteins (e.g., CDH1/E-cadherin): Maintain tissue architecture. Loss of CDH1 disrupts cell-cell adhesion and drives invasive lobular breast carcinoma.
- Receptors (e.g., ESR1/oestrogen receptor, ERBB2/HER2): Detect extracellular signals and relay them inside the cell. Amplification or mutation of receptors can make cancer cells grow without external signals.
Why protein function matters here: When we measure dN/dS, we are asking how tolerant a protein is to amino acid changes. Proteins with critical, tightly optimised structures (like transcription factor DNA-binding domains) show very low dN/dS because almost any amino acid change breaks them.
3.0.3 What is a Mutation?¶
A mutation is any change in the DNA sequence. Mutations can be classified by:
By origin: - Germline mutations: Present in egg or sperm cells; inherited by offspring; present in every cell of the body. These are the mutations measured by comparing human and mouse DNA (germline dN/dS). They accumulate over millions of years. - Somatic mutations: Arise in a single body cell during a person's lifetime (due to DNA replication errors, carcinogen exposure, UV radiation, etc.). NOT inherited. These are the mutations measured in TCGA tumour samples (somatic dN/dS). They accumulate over years to decades.
By effect on protein: - Synonymous (silent): Changes the DNA codon but NOT the amino acid (e.g., GCCβGCT, both = Alanine). The protein is unaffected. These serve as the neutral baseline in dN/dS analysis. - Nonsynonymous: Changes both the codon AND the amino acid: - Missense: One amino acid β different amino acid (e.g., ValβGlu in BRAF V600E). May alter protein function. - Nonsense: Creates a premature stop codon β truncated, usually non-functional protein. - Frameshift: Insertion or deletion that shifts the reading frame β completely garbled protein downstream. - Splice-site: Disrupts mRNA splicing β abnormal exon usage β altered or absent protein.
3.0.4 What is Natural Selection?¶
Natural selection is the process by which organisms with traits that enhance survival and reproduction become more common in a population over time. At the molecular level:
- Purifying (negative) selection: Harmful mutations are removed from the population because organisms carrying them are less fit. Proteins under purifying selection are conserved β they change very slowly over evolutionary time. A gene with dN/dS = 0.05 has had 95% of its amino acid-changing mutations eliminated by purifying selection.
- Positive (Darwinian) selection: Beneficial mutations spread through the population because they confer an advantage. Proteins under positive selection accumulate amino acid changes faster than silent changes (dN/dS > 1). In cancer, "positive selection" means mutations that help tumour cells grow and survive.
- Neutral drift: Mutations with no effect on fitness accumulate randomly (dN/dS β 1). These are neither helpful nor harmful.
The key insight of this project: The same gene can be under purifying selection in the germline (the protein is essential β don't change it) and under positive selection in somatic tumour cells (cancer benefits from specific modifications). This paradox reveals cancer's strategy: hijacking the cell's most critical machinery.
3.0.5 What is an Ortholog?¶
An ortholog is a gene in a different species that descended from the same ancestral gene through speciation (not gene duplication). Orthologs typically retain the same function across species:
- Human TP53 and Mouse Trp53 are orthologs β both encode the p53 tumour suppressor protein.
- They share ~80% amino acid identity after ~90 million years of independent evolution.
- The fact that they are so similar after 90 million years means natural selection has been strongly conserving this protein in both lineages.
Evolutionary distances used in this project:
| Species Pair | Divergence Time | What It Tells Us |
|---|---|---|
| Human β Mouse | ~90 MYA | Primary comparison; sufficient divergence to measure selection |
| Human β Rat | ~90 MYA | Independent replicate of mouse comparison (rodent lineage) |
| Human β Dog | ~96 MYA | Non-rodent comparison; validates patterns |
| Human β Zebrafish | ~435 MYA | Extremely distant; only the most ancient, universally essential proteins remain conserved |
If a protein is >90% identical between human and zebrafish (435 million years apart), it is one of the most constrained proteins in the vertebrate genome β it performs a function so fundamental that it has been essentially unchanged since before the age of dinosaurs.
3.0.6 What is Cancer? A Molecular Perspective¶
Cancer is fundamentally a disease of uncontrolled cell growth driven by accumulated genetic alterations. A normal cell becomes cancerous through a multi-step process:
-
Initiation: A cell acquires a mutation in a key growth-control gene (an "initiating" driver mutation). For example, a BRCA1 mutation compromises DNA repair.
-
Promotion: Additional mutations accumulate over years/decades. Each mutation may provide a slight growth advantage β the cell divides a bit faster, survives a bit longer, or evades immune detection slightly better.
-
Progression: The tumour becomes increasingly aggressive, invading surrounding tissues and eventually metastasising to distant organs.
A typical breast tumour carries 30β80 coding mutations, but only 3β6 are drivers β the rest are neutral passengers. This project's somatic dN/dS analysis aims to identify which genes carry driver mutations across the TCGA-BRCA cohort.
Cancer as evolution: A tumour is a population of cells undergoing Darwinian evolution. Cells with growth-promoting mutations outcompete neighbouring cells. This is why somatic dN/dS > 1 for driver genes β the nonsynonymous mutations are being positively selected because they help the tumour cell lineage expand. This within-patient evolution occurs on a timescale of years to decades, whereas germline evolution between species occurs over millions of years.
3.0.7 What is Gene Expression and Why Does Cancer Change It?¶
Gene expression refers to how "active" a gene is β quantified by the amount of mRNA it produces. In a normal cell, gene expression is tightly regulated: growth genes are activated when the cell needs to divide, then silenced when division is complete. In cancer:
- Oncogenes (growth-promoting genes like MYC, PIK3CA, ERBB2) become overexpressed β stuck in the "on" position, driving constant proliferation.
- Tumour suppressors (growth-inhibiting genes like TP53, RB1, BRCA1) become silenced or mutated β the brakes are removed.
- Metabolic genes are reprogrammed β cancer cells switch to glycolysis even in the presence of oxygen (the Warburg effect).
- Immune-evasion genes are activated β cancer hides from the immune system by expressing checkpoint ligands (PD-L1) or reducing antigen presentation.
These expression changes are massive, consistent, and detectable by ML. A gene like MMP11 (matrix metalloproteinase 11) may be expressed at 50Γ higher levels in breast tumours versus normal tissue β an easy signal for any classifier to learn. The fact that ML models achieve >99% accuracy tells us that cancer's transcriptomic rewiring is profound and reproducible across patients.
3.1 What is RNA-seq Transcriptomics?¶
Every cell in your body contains the same DNA, but different cell types express (activate) different sets of genes. RNA-seq (RNA sequencing) measures the activity level of every gene in a tissue sample by counting how many messenger RNA (mRNA) copies each gene has produced.
- A gene with high expression produces many mRNA transcripts β the cell is actively using that gene's protein product.
- A gene with low or zero expression is effectively "turned off" in that tissue.
In cancer, gene expression is dramatically altered. Tumour suppressor genes may be silenced while oncogenes are hyperactivated. By comparing expression profiles of tumour vs normal tissue, we can identify which genes are consistently dysregulated β and that is exactly what the ML models learn to do.
Key numbers in this project: - ~20,000 genes measured per sample - ~13,660 genes retained after filtering low-variance genes - Expression values are RSEM-normalised expected counts from TCGA
3.2 What is TCGA?¶
The Cancer Genome Atlas (TCGA) is a landmark NIH-funded project that molecularly characterised over 20,000 primary cancers across 33 cancer types. It provides:
- RNA-seq expression for every sample (what this project uses for ML)
- Whole-exome somatic mutation data (what this project uses for somatic dN/dS)
- Clinical metadata: tumour stage, molecular subtype, survival data
For breast cancer (BRCA), TCGA provides ~1,218 samples: approximately 1,104 primary tumour samples and 114 solid tissue normal samples from adjacent tissue.
3.3 Breast Cancer Molecular Subtypes (PAM50)¶
Breast cancer is not one disease β it comprises molecularly distinct subtypes identified by the PAM50 gene panel:
| Subtype | Frequency | Key Features | Prognosis |
|---|---|---|---|
| Luminal A | ~40% | ER+, PR+, HER2β, low proliferation | Best |
| Luminal B | ~20% | ER+, PRΒ±, HER2Β±, high proliferation | Intermediate |
| HER2-enriched | ~15% | HER2 amplified, ERβ, PRβ | Poor (without targeted therapy) |
| Basal-like | ~15% | Triple-negative (ERβ, PRβ, HER2β), high proliferation | Worst |
| Normal-like | ~10% | Resembles normal breast tissue | Variable |
This project's Subtype task trains models to distinguish Luminal A from Basal-like β the two most molecularly distinct subtypes β achieving perfect classification (100% AUC), demonstrating how profoundly different their transcriptomic landscapes are.
3.4 The dN/dS Ratio β The Gold Standard for Measuring Selection¶
The dN/dS ratio (also called Ο, omega) is the single most important metric on this website. Understanding it deeply is essential to interpreting every chart and table.
What Are Synonymous and Nonsynonymous Mutations?¶
DNA is read in codons β triplets of nucleotides that each specify an amino acid. Due to the redundancy of the genetic code, some DNA changes alter the resulting amino acid (and therefore the protein) while others do not:
- Synonymous (silent) substitution: A DNA change that does NOT alter the amino acid. Example: GCC β GCT both encode Alanine. The protein is unchanged.
- Nonsynonymous substitution: A DNA change that DOES alter the amino acid. Example: GCC (Ala) β GAC (Asp). The protein's structure and function may be affected.
Why Does This Matter?¶
Synonymous changes are largely invisible to natural selection β the protein is the same regardless. They accumulate at a roughly constant rate over evolutionary time and serve as a molecular clock β a baseline mutation rate.
Nonsynonymous changes, however, ARE visible to selection because they alter the protein. If a nonsynonymous change is harmful, natural selection will eliminate the organisms carrying it (purifying selection). If beneficial, selection will spread it through the population (positive selection).
The Ratio¶
dN/dS = (rate of nonsynonymous substitution) / (rate of synonymous substitution)
| dN/dS Value | Interpretation | Biological Meaning |
|---|---|---|
| dN/dS βͺ 1 (e.g., 0.05) | Strong purifying selection | The protein is under intense functional constraint. Almost every amino acid change is harmful and removed by selection. The gene encodes something essential. |
| dN/dS < 1 (e.g., 0.3β0.9) | Purifying selection | The protein is functionally important but tolerates some variation. |
| dN/dS β 1 | Neutral evolution | Nonsynonymous changes accumulate at the same rate as synonymous β the protein is under no selective pressure. |
| dN/dS > 1 (e.g., 2.0+) | Positive selection | Nonsynonymous changes accumulate faster than synonymous. The protein is being actively modified by selection β amino acid changes confer an advantage. |
| dN/dS = β | Infinite (n_syn = 0) | All observed mutations are protein-altering, with zero synonymous mutations. Strong positive selection signal, though with high statistical uncertainty due to lack of synonymous baseline. |
3.5 Germline vs Somatic dN/dS β Two Timescales of Selection¶
This project applies dN/dS analysis at two fundamentally different timescales:
Germline dN/dS (Deep Evolutionary Conservation)¶
- What it measures: How much the protein has changed between human and mouse since their common ancestor ~90 million years ago.
- Timescale: Millions of years of evolution.
- What it tells us: If a protein has remained nearly identical for 90 million years of mammalian evolution, it performs a function so critical that almost any change to it is lethal. These are the cell's most essential genes.
- Method: Nei-Gojobori with Jukes-Cantor correction (see Section 7).
Somatic dN/dS (Within-Tumour Selection)¶
- What it measures: Whether protein-altering mutations in a gene accumulate more frequently than expected by chance across TCGA-BRCA tumour samples.
- Timescale: Years to decades (the lifetime of a tumour).
- What it tells us: If a gene has somatic dN/dS > 1, tumour cells that acquire mutations in this gene have a growth advantage β the mutations are being positively selected during tumour evolution.
- Method: Binomial test comparing observed nonsynonymous/synonymous ratio to genome-wide expectation (ns_ratio β 2.5).
Critical caveat: Germline and somatic dN/dS values are NOT directly comparable in absolute magnitude. They operate on different timescales, use different methods, and have different baselines. The website uses them for relative ranking within each domain and for identifying genes that are extreme in BOTH domains simultaneously.
3.6 The Cancer Hallmarks¶
Cancer is characterised by a set of acquired capabilities known as the Hallmarks of Cancer (Hanahan & Weinberg, 2000; updated 2011 and 2022). These are the fundamental biological programmes that all cancers must activate:
| # | Hallmark | What It Means | Example Gene on This Site |
|---|---|---|---|
| 1 | Sustaining proliferative signalling | Cancer cells produce their own growth signals or amplify receptors so they don't need external permission to divide | PIK3CA (constitutively activates PI3K growth pathway) |
| 2 | Evading growth suppressors | Normal cells have "brakes" β tumour suppressors β that stop division when something is wrong. Cancer disables these brakes. | TP53 (the "guardian of the genome"; mutated in >50% of all cancers) |
| 3 | Resisting cell death (apoptosis) | Damaged cells normally self-destruct via programmed cell death. Cancer cells disable the self-destruct mechanism. | BCL2 family (anti-apoptotic proteins often overexpressed in cancer) |
| 4 | Enabling replicative immortality | Normal cells can only divide ~50β70 times (Hayflick limit) before their telomeres shorten critically. Cancer cells activate telomerase to maintain telomeres indefinitely. | TERT (telomerase reverse transcriptase) |
| 5 | Inducing angiogenesis | Tumours beyond ~1mmΒ³ need their own blood supply. Cancer cells secrete signals that recruit new blood vessels. | VEGF signalling pathway |
| 6 | Activating invasion & metastasis | Cancer cells break free from their tissue of origin, invade surrounding structures, and colonise distant organs. | CDH1/E-cadherin (loss enables cells to detach); MMP11 (degrades extracellular matrix) |
| 7 | Deregulating cellular energetics | Cancer cells reprogram their metabolism to fuel rapid growth, even using less efficient energy pathways (Warburg effect: aerobic glycolysis). | Metabolic genes in ML signatures |
| 8 | Avoiding immune destruction | The immune system normally detects and kills abnormal cells. Cancer learns to evade or suppress immune responses. | PD-L1 expression, MHC class I downregulation |
Emerging hallmarks (Hanahan, 2022): unlocking phenotypic plasticity, non-mutational epigenetic reprogramming, polymorphic microbiomes, and senescent cells.
Connection to this project: The ML models learn to classify tumour vs normal tissue by detecting expression changes across all these hallmarks simultaneously. The gene signatures are enriched for genes involved in proliferation (hallmark 1), growth suppression evasion (hallmark 2), and invasion (hallmark 6) β precisely because these programmes are most dramatically altered in cancer.
3.7 The Tumour Microenvironment¶
A tumour is not just cancer cells. The tumour microenvironment (TME) is a complex ecosystem containing:
- Cancer cells: The malignant cells carrying driver mutations.
- Cancer-associated fibroblasts (CAFs): Stromal cells recruited by the tumour that produce extracellular matrix and growth factors. Genes like MMP11 and MFAP5 (both in the candidate list) are expressed by CAFs.
- Immune cells: T cells, macrophages, natural killer cells β some attack the tumour, others are co-opted to support it.
- Endothelial cells: Form blood vessels feeding the tumour.
- Extracellular matrix (ECM): The structural scaffold surrounding cells; cancer remodels the ECM to facilitate invasion.
Why this matters for RNA-seq analysis: Bulk RNA-seq (as used in TCGA) measures the average gene expression across ALL cell types in a tissue sample β cancer cells, fibroblasts, immune cells, and stroma mixed together. This means:
- Some ML-predictive genes may be expressed by cancer cells themselves (intrinsic cancer biology)
- Others may reflect the tumour microenvironment's response (e.g., immune infiltration markers, CAF genes)
- The distinction matters for therapeutic targeting but cannot be resolved by bulk RNA-seq alone (single-cell RNA-seq is needed)
3.8 Driver vs Passenger Mutations¶
Not all somatic mutations in a tumour contribute to cancer:
- Driver mutations: Confer a selective growth advantage to the tumour cell. They are positively selected and recur across independent tumours. Detectable by somatic dN/dS > 1.
- Passenger mutations: Neutral "hitchhikers" that happened to be present in the cell when a driver mutation occurred. They accumulate passively and show dN/dS β 1.
3.9 Breast Cancer Biology in Depth¶
Breast cancer is the most common cancer in women worldwide (~2.3 million new cases/year). Understanding its biology is essential for interpreting this platform's results.
Anatomy and Cell Types¶
The breast contains mammary glands (lobules) connected by ducts, embedded in fatty and connective tissue. Two key cell types line the ducts:
- Luminal epithelial cells: Line the inner surface of ducts. Express oestrogen receptor (ER/ESR1) and progesterone receptor (PR/PGR). Most breast cancers arise from these cells (Luminal A/B subtypes).
- Basal/myoepithelial cells: Form the outer layer of ducts. Express keratins (KRT5, KRT14) and contractile proteins. Basal-like breast cancers resemble these cells.
The molecular subtype of a breast cancer reflects which cell type it most resembles and which signalling programmes are active:
| Subtype | Resembles | Key Receptors | Key Pathways | Treatment |
|---|---|---|---|---|
| Luminal A | Luminal cells | ER+, PR+, HER2β | Oestrogen signalling, low proliferation | Endocrine therapy (tamoxifen, aromatase inhibitors) |
| Luminal B | Luminal cells | ER+, PRΒ±, HER2Β± | Oestrogen signalling + high proliferation | Endocrine therapy + chemotherapy |
| HER2-enriched | Variable | ERβ, PRβ, HER2+ | ERBB2/HER2 amplification β MAPK/PI3K | Anti-HER2 therapy (trastuzumab) |
| Basal-like | Basal cells | ERβ, PRβ, HER2β | High proliferation, DNA damage response | Chemotherapy (no targeted therapy available) |
Key transcription factors in this project's candidates: - GATA3: Master regulator of luminal differentiation. Directly activates ER and luminal keratins. Mutations cluster in zinc finger domains and frameshifts in the C-terminus. Present in ~10% of breast cancers. Its somatic dN/dS of 19.8 (99 nonsynonymous mutations, only 2 synonymous) makes it one of the most strongly selected genes on this platform. - FOXA1: Pioneer factor that opens compacted chromatin specifically at ER binding sites. Without FOXA1, ER cannot access its target genes. Mutations reprogram which genes ER activates, potentially driving therapy resistance. - FOXC1: Marker of basal-like subtype. Promotes epithelial-to-mesenchymal transition (EMT), increasing invasiveness.
Key Signalling Pathways in Breast Cancer¶
| Pathway | Key Genes | Role in Cancer | Connection to This Project |
|---|---|---|---|
| PI3K/AKT/mTOR | PIK3CA, AKT1, PTEN, mTOR | Cell growth, survival, metabolism. PIK3CA is mutated in ~36% of breast cancers. | PIK3CA has somatic dN/dS β 17.6; strong positive selection |
| ER signalling | ESR1, FOXA1, GATA3 | Drives luminal gene expression, proliferation in ER+ tumours | FOXA1 and GATA3 are both candidates with extreme somatic dN/dS |
| p53 pathway | TP53, MDM2, CDKN1A | DNA damage response, apoptosis. TP53 mutated in ~37% of BRCA cases (>80% of basal-like) | TP53 has somatic dN/dS β 35.9; the most selected gene |
| Cell adhesion | CDH1, CTNNA1, DSC2 | Cell-cell junctions. CDH1 loss = lobular carcinoma. DSC2 (desmosomal cadherin) is a candidate gene | DSC2 is in the candidate list (germline dN/dS = 0.197, somatic = β) |
| WNT signalling | FZD9, DKK4 | Embryonic development, stem cell maintenance. Aberrant activation drives cancer stem cells | FZD9 (Frizzled-9) and DKK4 are both candidates |
3.10 The Selection Paradox β Why This Project Matters¶
The central finding of this project can be framed as a paradox:
The genes that evolution has tried hardest to protect (low germline dN/dS) are the same genes that cancer most aggressively modifies (high somatic dN/dS).
This is not a contradiction β it reveals cancer's strategy:
-
Essential genes encode essential proteins. The cell depends on them for fundamental processes: transcription regulation, signal transduction, cell adhesion, DNA repair.
-
Cancer cannot simply delete these genes. If the cell loses TP53 entirely, it may die from accumulated DNA damage. If it loses CDH1, the tissue may fall apart in ways that don't benefit the tumour.
-
Instead, cancer acquires specific modifications β gain-of-function mutations in TP53, activating mutations in PIK3CA, truncating mutations in GATA3 that alter (but don't destroy) its transcription factor activity.
-
These are dependency genes: Cancer cells depend on the modified function of these proteins. Restoring normal function (or selectively targeting the mutant form) could specifically kill cancer cells while sparing normal tissue.
This is why the candidate gene list is not just an academic exercise β it points to potential therapeutic vulnerabilities. If a gene is both essential (conserved) and modified by cancer (somatically selected), it is a strong candidate for drug targeting.
4. Data Sources¶
| Data Type | Source | Details |
|---|---|---|
| RNA-seq Expression | TCGA via UCSC Xena Browser | RSEM-normalised expected counts, ~20,000 genes, 33 cancer types |
| BRCA Samples | TCGA-BRCA | 1,104 tumours + 114 solid tissue normals = 1,218 samples |
| Somatic Mutations | TCGA-BRCA WES MAF files | ~77,000 coding mutations from tumour-normal pairs |
| Ortholog Sequences | Ensembl BioMart (release 110+) | Human-mouse/rat/dog/zebrafish one-to-one orthologs |
| Protein Identity | Ensembl BioMart | Percent amino acid identity for each ortholog pair |
| CDS Alignments | Ensembl BioMart | Codon-aligned coding sequences for dN/dS calculation |
Data Processing Pipeline¶
Raw TCGA RNA-seq β Logβ(x+1) transform β Variance filtering (remove bottom 20%)
β Z-score standardisation (per-gene, training set) β Stratified 80/20 split (seed=42)
β ML training & evaluation β Gene signature extraction β Evolutionary analysis
5. Page-by-Page Guide¶
5.1 Overview Page¶
URL: cancertranscriptomics.space/ (home page)
The Overview page serves as the entry point and executive summary for the entire project. It presents the core hypothesis, key results, and navigation to detailed analyses.
Stat Cards (Top Row)¶
Six summary statistics are displayed as coloured cards:
| Card | Value | Colour | What It Means |
|---|---|---|---|
| BRCA ROC AUC | 0.999 | Blue | The best model achieves near-perfect tumour/normal discrimination in breast cancer. An AUC of 0.999 means if you randomly pick one tumour and one normal sample, the model correctly ranks the tumour higher 99.9% of the time. |
| Multi-cancer Accuracy | 99.8% | Blue | When classifying tumour vs normal across all 33 TCGA cancer types simultaneously, the model achieves 99.8% accuracy β demonstrating that transcriptomic tumour signatures are robust across cancer types. |
| Cross-cancer AUC | 0.998 | Blue | A model trained only on breast cancer data can classify lung adenocarcinoma (LUAD) samples with AUC 0.998 β proving that the learned expression signatures capture universal cancer biology, not tissue-specific artefacts. |
| Signature Genes | 132 | Indigo | The union of top-50 genes from each model across all tasks yields 132 unique ML-predictive genes. These form the "gene signature" β the minimal set capturing most discriminative information. |
| Under Purifying Selection | 96.4% | Green | Of the 110 signature genes with valid germline dN/dS data, 96.4% have dN/dS < 1 β they are under evolutionary constraint. Their proteins have been conserved for ~90 million years. |
| Positively Selected (Somatic) | 4,615 | Red | Across the entire genome, 4,615 genes show statistically significant positive selection in TCGA-BRCA tumours (somatic dN/dS > 1, FDR q < 0.05). |
Classification Tasks Table¶
The project trains ML models on four distinct classification tasks to test robustness and generalisability:
| Task | Training Data | Test Data | What It Tests |
|---|---|---|---|
| Single-cancer (BRCA) | BRCA tumour + normal | Held-out BRCA | Can expression distinguish breast tumour from normal? |
| Multi-cancer | All 33 TCGA tumour types + normals | Held-out mix | Is the tumour signature universal across cancer types? |
| Subtype | BRCA Luminal A + Basal-like | Held-out subtypes | Can expression distinguish molecular subtypes? |
| Cross-cancer (BRCAβLUAD) | BRCA only | Lung adenocarcinoma (LUAD) | Do breast cancer signatures transfer to other organs? |
Key Findings Section¶
Three major findings are presented:
-
Germline Conservation: ML-predictive genes have significantly lower germline dN/dS (mean 0.234) compared to random background genes, indicating strong evolutionary constraint. 80% are under strong purifying selection (dN/dS < 0.3).
-
Somatic Positive Selection: Known breast cancer drivers (TP53, PIK3CA, GATA3, CDH1, FOXA1) appear among the top somatically selected genes, validating the method. 163 total candidates across 5 cancer types pass all three filters using the updated thresholds (somatic dN/dS β₯ 1.5, FDR q < 0.05, CI lower bound > 1.0). [UPDATED]
-
The Selection Paradox: Genes conserved for 90 million years (essential, don't-touch-these-proteins) are the same genes accumulating protein-altering mutations in cancer. Cancer selectively breaks the cell's most critical machinery.
Performance Overview Chart¶
An interactive Plotly bar chart displays ROC AUC for each model Γ task combination. This visualisation allows direct comparison of model architectures across classification challenges.
Navigation Cards¶
Clickable cards link to each analysis page with a brief description and distinctive icon, guiding users through the logical flow: Models β Signatures β Evolution β Results.
5.2 Models Page¶
URL: cancertranscriptomics.space/models
This page presents the three ML model architectures and their performance across all classification tasks.
Model Architecture Cards¶
Three cards describe each model in detail:
Logistic Regression (L2-regularised)¶
P(tumour) = Ο(Ξ²β + Ξ²βΒ·geneβ + Ξ²βΒ·geneβ + ... + Ξ²βΒ·geneβ)
- Architecture: Linear classifier with sigmoid activation
- Regularisation: L2 penalty (C=1.0) β shrinks coefficients toward zero, preventing any single gene from dominating
- Feature importance: |Ξ²α΅’| β the absolute value of each gene's coefficient. Larger |Ξ²| = more discriminative gene
- Coefficient sign: Positive Ξ² = gene is upregulated in tumours; negative Ξ² = downregulated in tumours
- Strengths: Most interpretable model. Each gene gets exactly one number (its coefficient) telling you how much it contributes to classification and in which direction.
- Biological value: The sign of the coefficient directly tells you whether the gene is over- or under-expressed in cancer, which is immediately biologically interpretable.
Random Forest Classifier¶
- Architecture: Ensemble of 100β500 decision trees; each tree trained on a random subset of samples and genes
- Decision logic: Each tree asks binary questions ("Is gene X expression > threshold?") to partition samples. Final prediction = majority vote across all trees.
- Feature importance (Gini): Mean decrease in Gini impurity when a gene is used for splitting across all trees. Higher = gene is more useful for separating tumour from normal.
- Strengths: Captures non-linear relationships and gene-gene interactions. Robust to noise and outliers. Does not assume linear separability.
- Biological value: Can detect cases where a gene is only discriminative in combination with another gene (epistatic interactions in expression space).
Neural Network (Multi-Layer Perceptron)¶
Input(13,660) β Dense(512, ReLU, Dropout 0.3) β Dense(256, ReLU, Dropout 0.3)
β Dense(128, ReLU, Dropout 0.3) β Dense(1, Sigmoid)
- Architecture: Deep feedforward network with three hidden layers
- Regularisation: Dropout (p=0.3) at each layer + early stopping on validation loss
- Training: Adam optimiser, binary cross-entropy loss, batch size 32
- Feature importance: Mean |Wβα΅’| β average absolute weight connecting each input gene to the first hidden layer. Genes with large first-layer weights receive more "neural attention."
- Strengths: Can learn complex, hierarchical representations. Captures subtle combinatorial patterns across thousands of genes simultaneously.
- Biological value: May identify complex regulatory networks where the importance of a gene depends on the expression context of many other genes.
Performance Metrics Visualisations¶
Four grouped bar charts display model performance for each metric:
ROC AUC (Receiver Operating Characteristic β Area Under Curve)¶
What it measures: The model's ability to rank tumour samples higher than normal samples across all possible classification thresholds.
- AUC = 1.0: Perfect separation β every tumour sample is scored higher than every normal sample
- AUC = 0.5: Random chance β the model is guessing
- AUC > 0.99: Exceptional discrimination β the model almost never confuses tumour and normal
Why it matters for cancer: ROC AUC is threshold-independent, meaning it evaluates the model's overall discriminative ability regardless of where you set the "call it tumour" cutoff. This is crucial in clinical settings where the optimal threshold depends on the cost of false positives vs false negatives.
Results on this site:
| Task | Best Model | ROC AUC |
|---|---|---|
| Single-cancer (BRCA) | Logistic Regression | 0.999 |
| Multi-cancer | Logistic Regression | 0.9998 |
| Subtype (LumA vs Basal) | All models | 1.000 |
| Cross-cancer (BRCAβLUAD) | Logistic Regression | 0.998 |
| External Validation (BRCA) | RF & MLP | 1.000 |
| Pan-cancer (14 types) | Logistic Regression | 0.974 |
Accuracy¶
What it measures: The proportion of all predictions (tumour + normal) that are correct.
Accuracy = (True Positives + True Negatives) / Total Samples
Cancer context: High accuracy alone can be misleading if classes are imbalanced (e.g., 90% tumour samples β a model that always says "tumour" gets 90% accuracy). That's why AUC, precision, and recall are reported alongside accuracy.
Precision¶
What it measures: Of all samples the model calls tumour, what fraction actually are tumour?
Precision = True Positives / (True Positives + False Positives)
Cancer context: High precision means the model rarely calls a normal sample "tumour" (low false positive rate). In a diagnostic context, this means fewer unnecessary biopsies or treatments.
Recall (Sensitivity)¶
What it measures: Of all actual tumour samples, what fraction does the model correctly identify?
Recall = True Positives / (True Positives + False Negatives)
Cancer context: High recall means the model rarely misses a real tumour (low false negative rate). In a screening context, this is critical β a missed cancer is far more dangerous than a false alarm.
Complete Results Table¶
A sortable table shows all metrics for every model Γ task combination (18 rows total). Users can click column headers to sort by any metric.
Notes on Perfect Scores¶
The Subtype classification task achieves 100% across all metrics for all three models. This is biologically expected: Luminal A and Basal-like breast cancers have profoundly different transcriptomic profiles driven by entirely different molecular programmes (hormone signalling vs proliferation), making them trivially separable by any competent classifier. Perfect scores here validate the data quality and preprocessing rather than indicating model overfitting.
Similarly, External Validation achieving 100% AUC for RF and MLP demonstrates that the learned signatures genuinely capture cancer biology rather than training-set noise.
5.3 Signatures Page¶
URL: cancertranscriptomics.space/signatures
This page reveals which specific genes each model considers most important for classification, and how genes overlap across models and tasks.
What is a Gene Signature?¶
A gene signature is the set of genes whose expression levels most strongly contribute to a model's classification decisions. For each model and task:
- All genes are ranked by importance (model-specific metric)
- The top 50 genes form that model's "signature" for that task
- The union across all models and tasks gives the complete signature set
The project identifies 132 unique signature genes across all models and tasks.
Task Selector¶
A dropdown allows switching between classification tasks: - BRCA: Single-cancer tumour vs normal - Unified: Combined importance across tasks - Multi-cancer (Pan-cancer): Pan-cancer discrimination - Subtype (Luminal A vs Basal-like): Luminal A vs Basal-like - Cross-cancer (BRCA β LUAD): BRCA-trained, LUAD-tested
Model Selector¶
A toggle group switches between feature importance methods: - RF: Random Forest Gini importance - MLP: Neural network first-layer weights - LR: Logistic Regression absolute coefficients
Feature Importance Bar Chart¶
An interactive horizontal bar chart shows the top 30 genes ranked by importance for the selected task and model. Each bar's length represents the gene's importance score.
How to read it: - Longer bars = more important for classification - Genes at the top are the most discriminative - Hover over any bar to see the exact importance value - Compare across models (using the model selector) to see which genes are consistently important
Notable genes that frequently appear: - MMP11 (Matrix Metalloproteinase 11): Extracellular matrix degradation enzyme upregulated in invasive cancers. Important across multiple tasks. - FAM13A: GTPase-activating protein involved in metabolic regulation; frequently altered in cancers. - PPP1R12B: Protein phosphatase regulatory subunit involved in smooth muscle contraction and cytoskeletal regulation. - TMEM220: Transmembrane protein implicated as a tumour suppressor in digestive tract cancers.
Gene Ranking Table¶
A complete sortable table listing all genes in the selected signature with columns: - Rank: Position in the importance ranking - Gene: Official gene symbol (HGNC) - Importance: The model-specific importance score
Gene Categories Pie Chart¶
An interactive pie chart shows how signature genes distribute across functional categories:
| Category | Count | Definition | Biological Meaning |
|---|---|---|---|
| Shared Multi-cancer | 18 | Important in both BRCA-specific and pan-cancer tasks | These genes participate in universal cancer biology β processes dysregulated across many cancer types (proliferation, apoptosis evasion) |
| Subtype-specific | 50 | Important primarily for distinguishing Luminal A vs Basal-like | These genes reflect hormone receptor signalling (ESR1, FOXA1) and proliferation programmes that differ between subtypes |
| Cross-cancer Stable | 18 | Genes whose importance transfers from BRCA to LUAD | Tissue-independent cancer markers β these genes reflect shared tumourigenic mechanisms |
| Multi-cancer Only | 32 | Important only in multi-cancer classification | Pan-cancer specific signals |
| Subtype Only | 50 | Important only in subtype classification | Subtype-specific biology (overlaps heavily with subtype_specific) |
| Cross-cancer Only | 32 | Important only in cross-cancer task | Cross-tissue transferable signals |
5.4 Evolution Page¶
URL: cancertranscriptomics.space/evolution
This is the most scientifically rich page on the website. It presents the evolutionary analysis of ML-identified signature genes across two complementary dimensions: germline conservation and somatic selection.
The page is divided into two columns reflecting the two evolutionary timescales.
Left Column: Germline Conservation (Green Theme)¶
Stat Cards¶
| Stat | Value | Meaning |
|---|---|---|
| Predictive Genes Analysed | 110 (of 132 with valid dN/dS) | Number of ML-signature genes for which human-mouse ortholog dN/dS could be computed |
| Mean Germline dN/dS (Predictive) | 0.234 | Average evolutionary rate β well below 1.0, indicating pervasive purifying selection |
| Mean Germline dN/dS (Background) | 0.203 | Random genome-wide genes β also under purifying selection but the comparison reveals relative constraint |
| % Under Purifying Selection | 96.4% | Fraction of signature genes with dN/dS < 1.0 |
| % Under Strong Purifying Selection | 80.0% | Fraction with dN/dS < 0.3 β encoding highly constrained, essential proteins |
dN/dS Distribution Violin Plot¶
This visualisation shows the distribution of germline dN/dS values for ML-predictive genes vs random background genes.
How to read it: - The x-axis shows the dN/dS value (lower = more conserved) - Each "violin" shows the density of genes at each dN/dS value - The wider the violin at a particular dN/dS value, the more genes have that value - The median line shows the central tendency
Key observation: The predictive-gene violin is shifted left (toward lower dN/dS) compared to background, meaning ML-identified genes tend to be more conserved than random genes. The bulk of predictive genes cluster below dN/dS = 0.3, indicating strong evolutionary constraint.
Multi-Species Protein Identity Bar Chart¶
This chart compares average protein sequence identity (%) between ML-predictive genes and background genes across four species at different evolutionary distances:
| Species | Divergence Time | Predictive Mean %ID | Background Mean %ID |
|---|---|---|---|
| Mouse | ~90 MYA | 81.8% | 82.4% |
| Rat | ~90 MYA | 81.4% | 81.8% |
| Dog | ~96 MYA | 83.9% | 84.4% |
| Zebrafish | ~435 MYA | 53.0% | 57.4% |
Biological interpretation: Higher protein identity means the protein sequence has been more conserved across the species split. The comparison between predictive and background genes at each evolutionary distance reveals whether ML-identified genes are under stronger constraint.
Subtype-specific genes show the highest protein identity across mouse (86.1%), rat (83.9%), and dog (86.6%), suggesting that the transcription factors and signalling molecules distinguishing breast cancer subtypes are among the most ancient and conserved proteins in the mammalian genome.
Germline Gene Table¶
A sortable table listing every ML-predictive gene with its germline evolutionary data:
| Column | Description |
|---|---|
| Gene | Official gene symbol |
| Category | ML category (shared_multicancer, subtype_specific, etc.) |
| Mouse %ID | Protein sequence identity with mouse ortholog |
| Rat %ID | Protein sequence identity with rat ortholog |
| Dog %ID | Protein sequence identity with dog ortholog |
| Zebrafish %ID | Protein sequence identity with zebrafish ortholog |
| dN/dS | Germline dN/dS ratio (human vs mouse) |
| Selection | Classification badge: "Strongly Purifying" (< 0.3), "Purifying" (0.3β1.0), "Positive/Neutral" (> 1.0) |
Right Column: Somatic Selection (Red Theme)¶
Stat Cards¶
| Stat | Value | Meaning |
|---|---|---|
| Genes Tested | 13,208 | Total genes with at least 2 coding somatic mutations in TCGA-BRCA |
| Somatic dN/dS > 1 | 8,207 | Genes showing more nonsynonymous mutations than expected (potential positive selection) |
| Significant (FDR < 0.05) | 4,615 | Genes passing multiple-testing correction at 5% false discovery rate |
| Significant (FDR < 0.10) | 4,697 | A slightly more permissive threshold |
Germline vs Somatic Scatter Plot (KEY VISUALISATION)¶
This is arguably the most important chart on the entire website. It plots every gene with both germline and somatic dN/dS data in a 2D space:
- X-axis: Germline dN/dS (human-mouse evolutionary conservation)
- Y-axis: Somatic dN/dS (positive selection in TCGA-BRCA tumours), capped at 25 for visualisation (genes with infinite somatic dN/dS are plotted at y=25)
- Colour: Orange = ML-predictive gene, Blue = background gene
How to read the quadrants:
HIGH somatic dN/dS (y > 1)
β
Not conserved, β CONSERVED AND
selected in β SELECTED IN CANCER
cancer β β KEY QUADRANT (bottom-right
(rare) β if axes standard,
β top-left if low germline
β is on left)
βββββββββββββββββββββΌββββββββββββββββββββ
Not conserved, β Conserved,
not selected β not selected
(neutral genes) β in cancer
β (housekeeping)
β
LOW somatic dN/dS (y β 1)
HIGH germline ββββΌβββ LOW germline dN/dS
dN/dS β
- Bottom-left (low germline, low somatic): Genes that are conserved and not mutated in cancer β essential housekeeping genes that cancer leaves alone.
- Top-left (low germline, high somatic): THE MOST INTERESTING QUADRANT β genes conserved for 90 million years but positively selected in tumours. These are the cancer-dependency candidates.
- Bottom-right (high germline, low somatic): Genes under little evolutionary constraint and not selected in cancer β likely neutral or tissue-specific.
- Top-right (high germline, high somatic): Genes under neither germline constraint nor somatic selection β rapidly evolving and not cancer-relevant.
What to look for: ML-predictive genes (orange) that appear in the upper-left region β conserved AND somatically selected. These are the strongest candidate cancer dependencies.
Somatic dN/dS Distribution Histogram¶
A histogram showing the distribution of somatic dN/dS values across all tested genes.
Key features: - A large peak near dN/dS β 1 (neutral β most genes are passengers) - A long right tail of genes with dN/dS >> 1 (drivers) - Known drivers like TP53 (dN/dS β 35.9) and PIK3CA (dN/dS β 17.6) appear in the extreme right tail
Top Somatically Selected Genes Table¶
A sortable table showing genes with the highest somatic dN/dS, with columns:
| Column | Description |
|---|---|
| Gene | Gene symbol |
| Nonsynonymous (N) | Count of protein-altering somatic mutations across all TCGA-BRCA samples |
| Synonymous (S) | Count of silent somatic mutations |
| Somatic dN/dS | The ratio (displayed as β when S=0) |
| 95% CI | Confidence interval for the dN/dS estimate |
| FDR q-value | Benjamini-Hochberg corrected p-value |
Top known drivers in TCGA-BRCA:
| Gene | N | S | Somatic dN/dS | Role |
|---|---|---|---|---|
| TP53 | ~500+ | ~14 | ~35.9 | Tumour suppressor; disables apoptosis and DNA damage checkpoints. Most mutated gene in human cancer. |
| PIK3CA | ~350+ | ~8 | ~17.6 | Oncogene; activating mutations in the PI3K signalling pathway drive cell growth and survival. |
| GATA3 | 99 | 2 | 19.8 | Transcription factor for luminal breast differentiation; mutations alter luminal gene programmes. |
| CDH1 | ~80+ | ~3 | ~14.0 | E-cadherin; loss drives invasive lobular carcinoma through disrupted cell-cell adhesion. |
| FOXA1 | 34 | 1 | 13.6 | Pioneer transcription factor; opens chromatin for oestrogen receptor binding. Mutations alter ER-driven transcription. |
Hypothesis Assessment¶
Three coloured boxes present the formal hypothesis tests:
- H1 (Germline Conservation): Tests whether ML-predictive genes have lower germline dN/dS than background. Assessed via permutation test (10,000 permutations) and Mann-Whitney U test.
- H2 (Somatic Selection): Tests whether signature genes are enriched for somatic dN/dS > 1. Assessed via binomial enrichment.
- H3 (Dual Pressure): Tests whether genes under both germline constraint AND somatic positive selection are more likely to be ML-predictive. This is the integrative hypothesis.
dN/dS Educational Box¶
An expandable accordion explains dN/dS for non-specialists with examples, analogies, and interpretation guidelines.
5.5 Results Page¶
URL: cancertranscriptomics.space/results
The Results page integrates all analyses into a final prioritised list of candidate cancer-dependency genes. It represents the culmination of the entire pipeline.
Three-Step Pipeline Visualisation¶
Three connected cards illustrate the filtering funnel:
Step 1: ML Signature Step 2: Germline Filter Step 3: Somatic Filter
132 genes identified β Genes with dN/dS < 0.3 β Genes with somatic dN/dS > 1
by ML as predictive (strong purifying selection) AND FDR q < 0.05
of tumour state = deeply conserved proteins = positively selected in tumours
Summary Stat Cards¶
| Card | Value | Colour | Interpretation |
|---|---|---|---|
| Genes Tested | ~13,000+ | Grey | Total genes in the master annotated table |
| ML Signature | 132 | Blue | Genes identified by ML as discriminative |
| Conserved (dN/dS < 0.3) | 88 | Green | ML genes under strong germline purifying selection |
| Somatically Selected | Variable | Red | ML genes with somatic dN/dS > 1 AND FDR < 0.05 |
| Final Candidates | 25 | Purple | Genes passing ALL three filters simultaneously |
| % Purifying | 96.4% | Green | Fraction of testable signature genes under purifying selection |
Filtering Funnel¶
A visual funnel diagram shows how the gene count reduces at each filtering step: - Start: ~20,000 genes in genome - After ML: 132 signature genes - After germline filter: 88 with dN/dS < 0.3 - After somatic filter: 163 total candidates (across 5 cancer types; 15 cross-cancer validated) [UPDATED]
Candidates in Germline vs Somatic Space (Scatter Plot)¶
This scatter plot shows the final candidates for each cancer type. It highlights where each candidate falls in the germline-conservation Γ somatic-selection space. Updated threshold (April 2026): somatic dN/dS β₯ 1.5 with 95% CI lower bound > 1.0 and FDR q < 0.05. [UPDATED]
- X-axis: Germline dN/dS (all candidates have values < 0.3 by definition)
- Y-axis: Somatic dN/dS (all candidates have values β₯ 1.5 by definition; β values plotted at 25) [UPDATED]
- Hover: Gene name, exact values, mutation counts, ML category
Key observation: Candidates cluster in the extreme upper-left corner β very low germline dN/dS (highly conserved) combined with very high or infinite somatic dN/dS (strongly positively selected). The stricter threshold (dN/dS β₯ 1.5 with CI lower > 1.0) ensures only high-confidence candidates pass through.
Biological Interpretation Section¶
A text section explaining what the candidate genes mean biologically: - These genes encode proteins essential for normal cellular function (evidenced by conservation) - Cancer cannot simply delete these genes β it needs their function - Instead, cancer modifies them through specific protein-altering mutations - This makes them potential therapeutic targets: drugs that restore normal protein function could selectively harm cancer cells
Candidate Gene Table¶
A sortable multi-column table listing all candidates for the selected cancer type:
| Column | Description | How to Interpret |
|---|---|---|
| Gene | Official HGNC symbol | Clickable β opens detail panel |
| ML Category | Which classification task(s) identified this gene | Multi-category genes (e.g., "subtype_specific; Subtype") are more robust |
| Germline dN/dS | Human-mouse dN/dS ratio | Lower = more conserved. All candidates < 0.3 |
| Mouse %ID | Protein identity with mouse ortholog | Higher = more conserved protein structure |
| Somatic dN/dS | Somatic selection ratio (β if n_syn=0) | Higher = stronger positive selection in tumours |
| Reliability | Evidence strength badge | "Strong" (β₯10 nonsyn + FDR<0.05), "Moderate" (5β9 nonsyn or intermediate FDR), "Weak" (<5 nonsyn or FDRβ₯0.05) |
| Nonsyn | Count of nonsynonymous somatic mutations in TCGA-BRCA | More mutations = more evidence (but also more common genes tend to accumulate more) |
| Syn | Count of synonymous somatic mutations | Zero synonymous β infinite dN/dS. Low counts increase uncertainty. |
| FDR q | Benjamini-Hochberg corrected p-value | < 0.05 = statistically significant after multiple-testing correction |
| Priority Score | Composite ranking score | Higher = stronger candidate across all evidence dimensions |
Reliability Categories Explained¶
The reliability badge reflects confidence in the somatic dN/dS estimate:
| Badge | Criteria | Meaning |
|---|---|---|
| π’ Strong | β₯10 nonsynonymous mutations AND FDR q < 0.05 | High confidence β enough mutations for a reliable ratio estimate, statistically significant |
| π‘ Moderate | 5β9 nonsynonymous mutations, or borderline FDR | Reasonable evidence but wider confidence intervals |
| π΄ Weak | <5 nonsynonymous mutations OR FDR q β₯ 0.05 | Low mutation count means the dN/dS estimate is highly uncertain. Infinite dN/dS with only 2 nonsynonymous mutations is suggestive but not conclusive. |
Priority Score¶
The priority score is a composite ranking that combines multiple evidence dimensions into a single number for prioritisation. It is computed by rank-normalising each component to [0, 1] and summing:
- Germline conservation: 1/dN/dS (higher score for lower dN/dS = more conserved)
- Somatic selection: Somatic dN/dS value (higher score for higher selection)
- Expression change: |logβ fold-change| tumour vs normal (if available)
- DepMap dependency: Negative mean CRISPR effect score (if available)
Higher priority score = stronger candidate across all evidence types.
Gene Detail Panel¶
Clicking any gene name in the table opens a slide-over panel showing: - Gene symbol and full name - ML category membership - All evolutionary metrics (germline dN/dS, protein identity, somatic dN/dS) - Mutation counts and statistical significance - A plain-English biological interpretation generated for that gene - Links to external databases (NCBI Gene, UniProt, COSMIC)
β οΈ Archive β Previous Pipeline Results (Pre-April 2026) The following table shows results from an earlier pipeline version using different thresholds (dN/dS > 1.0, raw p-value < 0.05). These have been superseded by the current results above. Retained for reference only.
The 25 Candidate Genes¶
Here is the complete candidate list with key metrics:
| Gene | Germline dN/dS | Somatic dN/dS | N | S | Priority | ML Category |
|---|---|---|---|---|---|---|
| POU3F3 | 0.004 | β | 2 | 0 | 1.56 | Multi-cancer |
| SPP1 | 0.028 | β | 4 | 0 | 1.52 | Multi-cancer |
| FZD9 | 0.032 | β | 4 | 0 | 1.44 | subtype_specific |
| CCDC64 | 0.061 | β | 2 | 0 | 1.36 | Multi-cancer |
| TFDP2 | 0.066 | β | 2 | 0 | 1.32 | subtype_specific |
| MSX2 | 0.080 | β | 3 | 0 | 1.28 | subtype_specific |
| SPDEF | 0.080 | β | 4 | 0 | 1.24 | subtype_specific |
| PAMR1 | 0.080 | β | 7 | 0 | 1.20 | Multi-cancer |
| FOXC1 | 0.085 | β | 3 | 0 | 1.16 | subtype_specific |
| ZMYND10 | 0.096 | β | 4 | 0 | 1.12 | subtype_specific |
| THSD4 | 0.113 | β | 9 | 0 | 1.08 | subtype_specific |
| SERPINE2 | 0.134 | β | 2 | 0 | 1.04 | Cross-cancer |
| GATA3 | 0.032 | 19.8 | 99 | 2 | 1.00 | subtype_specific |
| ILDR2 | 0.138 | β | 3 | 0 | 1.00 | Multi-cancer |
| SCG2 | 0.151 | β | 3 | 0 | 0.96 | Cross-cancer |
| CILP2 | 0.170 | β | 2 | 0 | 0.92 | shared_multicancer |
| MFAP5 | 0.180 | β | 4 | 0 | 0.88 | shared_multicancer |
| FOXA1 | 0.059 | 13.6 | 34 | 1 | 0.88 | subtype_specific |
| DKK4 | 0.184 | β | 2 | 0 | 0.84 | Cross-cancer |
| MYOC | 0.187 | β | 3 | 0 | 0.80 | Multi-cancer |
| DSC2 | 0.197 | β | 7 | 0 | 0.76 | subtype_specific |
| AADACL2 | 0.204 | β | 2 | 0 | 0.72 | shared_multicancer |
| B3GNT5 | 0.232 | β | 3 | 0 | 0.68 | subtype_specific |
| F2RL3 | 0.244 | β | 2 | 0 | 0.64 | Cross-cancer |
| MMP27 | 0.254 | β | 2 | 0 | 0.60 | Multi-cancer |
Caveats and Limitations Section¶
The Results page includes important caveats:
-
Infinite somatic dN/dS: Most candidates have β somatic dN/dS (zero synonymous mutations). While this is consistent with positive selection, it could also reflect low mutation counts. The "Reliability" column flags this uncertainty.
-
Correlation β Causation: ML identifies genes whose expression correlates with tumour state. This does not prove they cause cancer. Evolutionary analysis adds evidence but experimental validation is needed.
-
BRCA-specific: Somatic analysis is specific to TCGA-BRCA. Candidates may not generalise to other cancer types.
-
Bulk RNA-seq limitations: Measures average expression across all cells in a sample. Cannot resolve tumour heterogeneity or cell-type-specific effects.
Next Steps Section¶
Suggests future validation approaches: - CRISPR functional screens (DepMap integration) - Single-cell RNA-seq to resolve cell-type specificity - Pan-cancer somatic dN/dS to test cross-cancer generality - Drug target assessment via protein structure analysis
5.6 How It Works Page¶
URL: cancertranscriptomics.space/how-it-works
This page provides an interactive mind map visualising the entire analysis pipeline using D3.js force-directed graph layout.
Mind Map Structure¶
The central node connects to seven colour-coded branches:
| Branch | Colour | Content |
|---|---|---|
| π₯ Data Acquisition | Grey (#64748b) | RNA-seq expression, somatic mutations, ortholog mappings, clinical metadata |
| βοΈ Preprocessing | Blue (#3b82f6) | Logβ(x+1) transform, variance filtering, Z-score standardisation, train/test split |
| π€ ML Classification | Blue (#3b82f6) | Logistic Regression, Random Forest, Neural Network training and evaluation |
| βοΈ Gene Signatures | Purple (#6366f1) | Feature importance extraction, top-50 selection, cross-model overlap, gene categories |
| πΏ Germline Conservation | Green (#10b981) | dN/dS calculation, Nei-Gojobori method, multi-species protein identity |
| π΄ Somatic Selection | Red (#ef4444) | Per-gene somatic dN/dS, binomial test, FDR correction, driver detection |
| π¬ Joint Analysis | Purple (#6366f1) | Germline Γ somatic scatter, candidate prioritisation, hypothesis testing |
Interactive Features¶
- Click any node to see a detail panel with description and bullet points
- Drag nodes to reposition them
- Zoom and pan to explore the map
- Expand/collapse child nodes with +/β indicators
- Colour-coded links highlight connections when a node is selected
Each node's detail panel provides educational content about that pipeline step, including formulas, parameter choices, and biological rationale.
5.7 Methods Page¶
URL: cancertranscriptomics.space/methods
The Methods page provides a comprehensive technical reference for all computational methods, organised as expandable accordion sections.
Section 1: Biological Rationale¶
Explains why each analysis is performed and how the three pillars (ML, germline, somatic) complement each other.
Section 2: Data Sources¶
Details every dataset used: TCGA RNA-seq, somatic mutation MAF files, Ensembl ortholog data, UCSC Xena browser.
Section 3: Preprocessing¶
Technical details on each preprocessing step:
- Logβ(x+1) transform: Compresses dynamic range from 0β100,000+ to 0β17. The pseudocount (+1) prevents log(0) = undefined. Stabilises variance so high-expression genes don't dominate.
- Variance filtering: Removes the bottom 20% of genes by variance. These genes show almost no variation across samples and carry no discriminative signal. Typically reduces features from ~20,000 to ~13,660.
- Z-score standardisation: Per-gene: z = (x β ΞΌ) / Ο using training-set statistics only (prevents data leakage). After standardisation, all genes have mean=0 and Ο=1, ensuring no gene dominates by absolute expression level. Critical for L2-regularised LR.
- Stratified train/test split: 80/20 split preserving the tumour:normal class ratio. Fixed random seed (42) for reproducibility.
Section 4: ML Models¶
Full hyperparameters and architecture for each model:
- LR: solver=lbfgs, C=1.0 (L2 regularisation strength), max_iter=1000
- RF: n_estimators=500, max_features=sqrt, min_samples_leaf=5, class_weight=balanced
- MLP: layers=[512,256,128], activation=ReLU, dropout=0.3, optimiser=Adam(lr=0.001), epochs=100 with early stopping (patience=10)
Section 5: Evaluation Metrics¶
Formal definitions of accuracy, precision, recall, ROC AUC, and F1-score with their mathematical formulas.
Section 6: Gene Signature Extraction¶
How signatures are derived from each model: - LR: Rank by |coefficient|, top 50 - RF: Rank by Gini importance (mean decrease in impurity), top 50 - MLP: Rank by mean |first-layer weights|, top 50 - Union across models: ~100β200 unique genes β final set of 132
Section 7: Germline Evolutionary Analysis¶
Full Nei-Gojobori method with Jukes-Cantor correction (see Section 7 below for complete mathematical description).
Section 8: Somatic dN/dS Analysis¶
Complete method for per-gene somatic dN/dS: - Input: TCGA-BRCA MAF files (filtered to primary tumours, coding variants only) - Classification: 9 nonsynonymous variant types + synonymous (Silent) - Formula, binomial test, delta method CI, BH-FDR correction - Minimum 2 mutations per gene required
Section 9: Joint Germline-Somatic Analysis¶
How germline and somatic data are merged, visualised, and interpreted together.
Section 10: Classification Tasks Table¶
Complete description of all tasks with training/test set sizes.
Section 11: Limitations & Future Directions¶
Comprehensive list of current limitations and planned improvements.
6. Key Metrics & Statistics Explained¶
This section provides a reference for every numerical metric displayed on the website.
6.1 Machine Learning Metrics¶
ROC AUC (Area Under the Receiver Operating Characteristic Curve)¶
The ROC curve plots True Positive Rate (sensitivity/recall) vs False Positive Rate (1 β specificity) at every possible classification threshold. AUC summarises this curve as a single number.
- AUC = 1.0: Perfect classifier β 100% sensitivity at 0% false positive rate for some threshold
- AUC = 0.5: Random classifier (coin flip)
- AUC interpretation: The probability that a randomly chosen tumour sample is scored higher than a randomly chosen normal sample
Why AUC is the primary metric here: Unlike accuracy, AUC is not affected by class imbalance (there are ~10Γ more tumour than normal samples in BRCA). It evaluates the model's ability to rank samples, not just classify them at a fixed threshold.
Accuracy¶
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Overall fraction of correct predictions. Simple but can be misleading with imbalanced classes.
Precision (Positive Predictive Value)¶
Precision = TP / (TP + FP)
"When the model says tumour, how often is it right?" High precision = few false alarms.
Recall (Sensitivity, True Positive Rate)¶
Recall = TP / (TP + FN)
"Of all real tumours, what fraction does the model catch?" High recall = few missed cancers.
F1-Score¶
F1 = 2 Γ (Precision Γ Recall) / (Precision + Recall)
Harmonic mean of precision and recall. Balances the trade-off between the two.
6.2 Feature Importance Metrics¶
| Model | Importance Metric | Formula | Interpretation |
|---|---|---|---|
| Logistic Regression | Absolute coefficient | |Ξ²α΅’| | Direct linear contribution to tumour probability. Sign indicates direction (up/down in tumour). |
| Random Forest | Gini importance | Mean decrease in Gini impurity across all trees when gene i is used for splitting | How much a gene reduces classification uncertainty when used in a decision. |
| MLP | First-layer weight magnitude | mean(|Wβα΅’|) | How much "attention" the network's first layer pays to each gene. Genes with large weights have the most influence on learned representations. |
6.3 Evolutionary Metrics¶
Germline dN/dS¶
Ο = dN / dS
where:
dN = nonsynonymous divergence (corrected by Jukes-Cantor)
dS = synonymous divergence (corrected by Jukes-Cantor)
See Section 7 for complete derivation.
Somatic dN/dS¶
Ο_somatic = (N / S) / ns_ratio
where:
N = observed nonsynonymous mutations in TCGA-BRCA
S = observed synonymous mutations
ns_ratio β 2.5 (genome-wide ratio of nonsynonymous to synonymous sites)
Protein Sequence Identity (%)¶
%ID = (number of identical amino acids / alignment length) Γ 100
Computed from Ensembl BioMart pairwise protein alignments between human and each ortholog species.
FDR q-value (Benjamini-Hochberg)¶
When testing thousands of genes simultaneously, some will appear significant by chance. The False Discovery Rate (FDR) controls the expected proportion of false positives among all genes declared significant.
q-value for gene ranked i (out of m total):
q_i = p_i Γ (m / i)
- q < 0.05: We expect fewer than 5% of genes called "significant" to be false positives
- q < 0.01: Fewer than 1% false positives expected
The BH procedure is less conservative than Bonferroni correction (which controls the probability of any false positive), making it more appropriate for genomic-scale analyses where we expect many true positives.
Cohen's d (Effect Size)¶
d = (meanβ - meanβ) / pooled_standard_deviation
Measures the magnitude of difference between two groups (e.g., predictive vs background dN/dS) in standard deviation units:
| d Value | Interpretation |
|---|---|
| < 0.2 | Negligible effect |
| 0.2β0.5 | Small effect |
| 0.5β0.8 | Medium effect |
| > 0.8 | Large effect |
Cohen's d is reported on the Evolution page for the germline dN/dS comparison between ML-predictive and background genes.
Permutation Test P-value¶
To test whether ML-predictive genes are more conserved than expected by chance:
- Calculate the observed mean dN/dS for predictive genes
- Randomly shuffle gene labels (predictive/background) 10,000 times
- Calculate mean dN/dS for the random "predictive" set each time
- p-value = fraction of permutations where random mean β€ observed mean
This is a non-parametric test that makes no assumptions about the distribution of dN/dS values.
7. Evolutionary Analysis In Depth¶
7.1 Germline dN/dS: The Nei-Gojobori Method¶
The Nei-Gojobori (1986) method is used to compute dN/dS between human and mouse orthologs.
Step 1: Classify Sites¶
For each codon in the aligned sequences, determine how many of its 9 possible point mutations (3 positions Γ 3 alternative nucleotides) are synonymous vs nonsynonymous. This gives the number of synonymous sites (S) and nonsynonymous sites (N) per codon. Sum across all codons for total S and N.
Example: The codon TTT (Phe): - Position 1: Any change β different amino acid (3 nonsynonymous changes) - Position 2: Any change β different amino acid (3 nonsynonymous changes) - Position 3: TTC (Phe) = synonymous; TTA (Leu) = nonsynonymous; TTG (Leu) = nonsynonymous - Total: 7 nonsynonymous sites/3, 2 synonymous sites/3 β N=7/3, S=2/3 for this codon
Step 2: Count Differences¶
Compare human and mouse codons at each aligned position. Classify each difference as synonymous or nonsynonymous: sd (synonymous differences) and nd (nonsynonymous differences).
Step 3: Compute Proportions¶
pS = sd / S (proportion of synonymous sites that differ)
pN = nd / N (proportion of nonsynonymous sites that differ)
Step 4: Jukes-Cantor Correction¶
Over 90 million years, some sites have mutated multiple times. The Jukes-Cantor model corrects for these "multiple hits" that are invisible in pairwise comparison:
dS = -(3/4) Γ ln(1 - (4/3) Γ pS)
dN = -(3/4) Γ ln(1 - (4/3) Γ pN)
Why is this necessary? If two species have diverged for a very long time, a site that mutated AβGβC will appear as AβC (one change), hiding the intermediate step. The JC correction estimates the true number of substitutions from the observed proportion of differences. Without it, dN and dS would be systematically underestimated, especially for highly divergent sequences.
Step 5: Compute Ratio¶
Ο = dN / dS
7.2 Somatic dN/dS Calculation¶
For each gene, using all coding somatic mutations observed across TCGA-BRCA samples:
Counts¶
N = number of nonsynonymous somatic mutations
S = number of synonymous somatic mutations
Nonsynonymous includes: Missense, Nonsense, Frame_Shift_Del, Frame_Shift_Ins, Splice_Site, In_Frame_Del, In_Frame_Ins, Nonstop, Translation_Start_Site.
Expected Ratio¶
Under neutral evolution, the ratio N/S should equal the ratio of nonsynonymous to synonymous sites in the genome:
ns_ratio β 2.5
This comes from the genetic code structure: approximately 71.5% of all possible point mutations in coding sequences are nonsynonymous, giving a ratio of ~2.5:1 nonsynonymous-to-synonymous sites.
dN/dS Formula¶
Ο_somatic = (N / S) / ns_ratio = (N / S) / 2.5
If S = 0 (no synonymous mutations observed), Ο = β (infinite).
Statistical Test¶
Null hypothesis: N/(N+S) = ns_ratio/(1+ns_ratio) = 2.5/3.5 β 0.714
Test: Exact binomial test β is the observed proportion of nonsynonymous mutations significantly higher than 0.714?
p_value = binom_test(N, N+S, p=ns_ratio/(1+ns_ratio), alternative='two-sided')
Confidence Interval¶
Using the delta method on log(N/S):
SE = sqrt(1/N + 1/S)
95% CI = exp(log(N/S) Β± 1.96 Γ SE) / ns_ratio
When S = 0, the CI is undefined (logged as NaN).
Multiple Testing Correction¶
q_values = BH_FDR_correction(p_values)
Applied across all ~13,208 genes tested. Controls FDR at Ξ± = 0.05.
7.3 Multi-Species Conservation Analysis¶
Protein identity is measured across four species at different evolutionary distances:
| Species | Divergence (MYA) | Biological Significance |
|---|---|---|
| Mouse | ~90 | Primary comparison β sufficient divergence to measure selection, close enough for reliable ortholog identification |
| Rat | ~90 | Independent replicate of the mouse comparison |
| Dog | ~96 | Slightly more distant; confirms patterns seen in rodents |
| Zebrafish | ~435 | Very distant β only the most ancient, universal functions show high conservation here |
Genes that maintain high protein identity even at the zebrafish comparison (~435 million years) are under the most extreme evolutionary constraint β they perform functions so fundamental that they predate the divergence of fish and mammals.
8. Candidate Gene Profiles¶
β οΈ Archive β Previous Pipeline Results (Pre-April 2026) The following table shows results from an earlier pipeline version using different thresholds (dN/dS > 1.0, raw p-value < 0.05). These have been superseded by the current results above. Retained for reference only.
Detailed biological profiles of key candidate genes:
POU3F3 (Priority Score: 1.56 β Highest)¶
- Full name: POU Class 3 Homeobox 3
- Germline dN/dS: 0.004 (among the most conserved genes in the genome)
- Mouse protein identity: 99.0%
- Somatic dN/dS: β (2 nonsynonymous, 0 synonymous)
- Function: Transcription factor in neural development and cell differentiation
- Why it matters: A protein 99% identical between human and mouse after 90 million years of evolution is under extraordinarily strong constraint. Its appearance as an ML-predictive gene suggests it plays a previously unrecognised role in cancer transcriptomics. The somatic mutations, while few, are 100% protein-altering.
- Reliability: Weak (only 2 mutations β more data needed to confirm)
- Category: Multi-cancer classification gene
GATA3 (Priority Score: 1.00 β Strong Evidence)¶
- Full name: GATA Binding Protein 3
- Germline dN/dS: 0.032 (very strongly conserved)
- Somatic dN/dS: 19.8 (99 nonsynonymous, 2 synonymous)
- FDR q: 9.4 Γ 10β»ΒΉΒ² (extremely significant)
- Function: Master transcription factor for luminal breast epithelial differentiation. Directly regulates ESR1 (oestrogen receptor) expression. Mutations cluster in the zinc finger DNA-binding domain and C-terminal transactivation domain.
- Why it matters: GATA3 is the third most frequently mutated gene in breast cancer. Its mutations alter luminal differentiation programmes, and it is strongly conserved across 90 million years of evolution (dN/dS = 0.032 means only 3.2% as many amino acid changes as expected under neutrality). This gene perfectly exemplifies the cancer-dependency hypothesis: an essential transcription factor whose modification drives breast cancer biology.
- Reliability: Strong (99 nonsynonymous mutations, FDR < 10β»ΒΉΒΉ)
- Category: Subtype-specific (Luminal A vs Basal-like discriminator)
FOXA1 (Priority Score: 0.88 β Strong Evidence)¶
- Full name: Forkhead Box A1
- Germline dN/dS: 0.059 (very strongly conserved)
- Somatic dN/dS: 13.6 (34 nonsynonymous, 1 synonymous)
- FDR q: 6.3 Γ 10β»β΄ (highly significant)
- Function: Pioneer transcription factor that opens chromatin to enable oestrogen receptor binding. Mutations in breast cancer cluster in the forkhead domain and alter ER-dependent gene programmes. FOXA1 mutations are mutually exclusive with GATA3 mutations, suggesting they affect the same pathway.
- Why it matters: FOXA1 is a "pioneer factor" β it physically opens tightly packed chromatin to allow other transcription factors (especially ER) access to their target genes. It is conserved to dN/dS = 0.059 (94.1% of amino acid changes removed by selection). Cancer specifically mutates this chromatin gateway to reprogram gene expression.
- Reliability: Strong (34 nonsynonymous mutations, FDR < 0.001)
- Category: Subtype-specific
SPP1 (Priority Score: 1.52 β Second Highest)¶
- Full name: Secreted Phosphoprotein 1 (Osteopontin)
- Germline dN/dS: 0.028 (strongly conserved)
- Somatic dN/dS: β (4 nonsynonymous, 0 synonymous)
- Function: Extracellular matrix glycoprotein involved in cell adhesion, migration, and immune modulation. Overexpressed in many cancers and promotes metastasis.
- Why it matters: SPP1/Osteopontin is a well-known metastasis promoter. Its extreme conservation (dN/dS = 0.028) reflects its essential role in tissue remodelling and immune signalling. All four somatic mutations are protein-altering, suggesting selective pressure on its protein function in tumours.
- Reliability: Weak (only 4 mutations, but all nonsynonymous)
FOXC1 (Priority Score: 1.16)¶
- Full name: Forkhead Box C1
- Germline dN/dS: 0.085 (strongly conserved)
- Somatic dN/dS: β (3 nonsynonymous, 0 synonymous)
- Function: Transcription factor critical for mesenchymal differentiation, neural crest development, and vascular formation. Overexpression is associated with basal-like breast cancer and poor prognosis.
- Why it matters: FOXC1 marks the basal-like subtype β the most aggressive form of breast cancer. Its conservation reflects essential developmental functions. Its role in epithelial-mesenchymal transition (EMT) connects it directly to cancer invasion and metastasis.
- Category: Subtype-specific
MSX2 (Priority Score: 1.28)¶
- Full name: Msh Homeobox 2
- Germline dN/dS: 0.080 (strongly conserved)
- Somatic dN/dS: β (3 nonsynonymous, 0 synonymous)
- Function: Homeobox transcription factor involved in limb and craniofacial development, bone morphogenesis, and mammary gland development.
- Why it matters: MSX2 plays a role in mammary gland development and has been implicated in breast cancer cell proliferation and apoptosis resistance. Its strong conservation underscores its developmental importance.
- Category: Subtype-specific
9. Validation Framework [NEW]¶
The pipeline now incorporates multiple orthogonal validation layers to ensure candidate genes are biologically meaningful and not artefacts of data processing.
9.1 Cross-Cancer Validation¶
Genes identified as candidates in β₯2 independent cancer types are flagged as cross-cancer validated. This addresses the concern that single-cohort findings may be idiosyncratic.
| Metric | Value |
|---|---|
| Total candidates across 5 cancer types | 163 |
| Cross-cancer validated (β₯2 types) | 15 genes |
| Cancer types analysed | BRCA, BLCA, PRAD, LUAD, UCEC |
9.2 Kaplan-Meier Survival Analysis¶
For each candidate gene, patients are stratified into high vs low expression groups (median split), and survival curves are compared using the log-rank test.
- Library: lifelines (Python)
- Output: logrank p-value, hazard ratio estimate, median survival times
- Validation threshold: p < 0.05 (with Bonferroni correction for multiple comparisons)
Genes where high expression correlates with worse survival provide additional evidence for clinical relevance.
9.3 GSEA Pathway Enrichment¶
Gene Set Enrichment Analysis tests whether candidate genes are enriched in known biological pathways.
- Databases: Reactome, KEGG, GO Biological Process
- Library: gseapy
- Output: Top 5 enriched pathways per cancer type, normalized enrichment scores, FDR q-values
Signatures where no coherent pathway emerges (all q > 0.1) are flagged as "biologically diffuse" for review.
9.4 Multi-Omics Convergence¶
Cross-validation against orthogonal TCGA data types strengthens confidence in expression-based findings.
| Evidence Layer | Source | Expected Pattern |
|---|---|---|
| Copy Number Variation | TCGA CNV | Amplification/deletion correlates with expression |
| DNA Methylation | TCGA 450K/EPIC array | Promoter hypermethylation inversely correlates with expression |
Convergence Score (0β3): - 0 = Expression only - 1 = Expression + CNV - 2 = Expression + Methylation - 3 = Expression + CNV + Methylation (most trusted)
Genes with convergence score β₯ 2 are prioritised in downstream analyses.
9.5 External Database Cross-Reference¶
Candidates are automatically annotated against curated cancer gene databases:
| Database | Description | Evidence Value |
|---|---|---|
| COSMIC Cancer Gene Census | 700+ genes with mechanistic evidence for cancer driver roles | Known oncogene/TSG |
| OncoKB | Clinically annotated cancer genes with therapeutic implications | Actionable mutation data |
Output columns: known_oncogene (bool), evidence_source ('COSMIC', 'OncoKB', 'both', 'novel')
9.6 Updated Filtering Thresholds [April 2026]¶
| Parameter | Old Value | New Value | Rationale |
|---|---|---|---|
| Somatic dN/dS threshold | > 1.0 | β₯ 1.5 | Reduces false positives from neutral drift |
| CI requirement | None | Lower bound > 1.0 | Ensures statistical confidence |
| FDR threshold | < 0.05 | < 0.05 (unchanged) | Standard significance level |
| Germline dN/dS | < 0.3 | < 0.3 (unchanged) | Strong purifying selection |
10. Statistical Methods Reference¶
9.1 Permutation Testing¶
Used to test whether ML-predictive genes have significantly different dN/dS from random genes.
Procedure: 1. Observe the true difference in mean dN/dS between predictive (n=110) and background (n=465) gene sets 2. Randomly reassign "predictive" and "background" labels 10,000 times (preserving group sizes) 3. Compute the difference in means for each permutation 4. p-value = (number of permutations with difference β₯ observed) / 10,000
Advantage: Makes no assumptions about the distribution of dN/dS values (non-parametric). Robust to outliers and skewed distributions.
9.2 Mann-Whitney U Test¶
A non-parametric test comparing the ranks (not values) of two groups. Used as a complement to the permutation test for comparing dN/dS distributions.
Null hypothesis: The probability that a randomly chosen predictive gene has lower dN/dS than a randomly chosen background gene equals 50%.
9.3 Binomial Exact Test (Somatic dN/dS)¶
Tests whether the proportion of nonsynonymous mutations for a gene deviates from the neutral expectation.
Hβ: P(nonsynonymous) = ns_ratio / (1 + ns_ratio) β 0.714
Hβ: P(nonsynonymous) β 0.714 (two-sided)
Test statistic: N successes in (N + S) trials
Distribution: Binomial(N+S, 0.714) under Hβ
9.4 Benjamini-Hochberg FDR Correction¶
For m genes tested: 1. Sort p-values: pβββ β€ pβββ β€ ... β€ pβββ 2. For gene ranked i: qβα΅’β = min(pβα΅’β Γ m/i, qβα΅’βββ) 3. Starting from the largest, enforce monotonicity: qβα΅’β = min(qβα΅’β, qβα΅’βββ)
Interpretation: At FDR = 0.05, we accept that ~5% of genes declared significant may be false positives. With 4,615 significant genes, we expect ~231 false positives and ~4,384 true positives.
9.5 Delta Method (Confidence Intervals for dN/dS)¶
For somatic dN/dS = (N/S)/ns_ratio:
log(N/S) is approximately normal for large N, S
Var(log(N/S)) β 1/N + 1/S
SE = sqrt(1/N + 1/S)
95% CI for dN/dS:
Lower = exp(log(N/S) - 1.96 Γ SE) / ns_ratio
Upper = exp(log(N/S) + 1.96 Γ SE) / ns_ratio
When S = 0, the CI is undefined because log(N/0) = β.
11. Limitations & Caveats¶
10.1 Data Limitations¶
-
Bulk RNA-seq: Measures average gene expression across all cells in a tissue sample. Cannot distinguish tumour cell expression from stromal, immune, or vascular cell contributions. Single-cell RNA-seq would provide finer resolution.
-
TCGA cohort biases: TCGA samples are predominantly from North American patients and may not represent global genetic diversity. Treatment-naΓ―ve primary tumours only β does not capture metastatic or treated disease.
-
Somatic mutation calling: Depends on the specific variant-calling pipeline used by TCGA. Different pipelines can produce different mutation lists, particularly for indels and low-frequency variants.
10.2 Methodological Limitations¶
-
Germline dN/dS assumptions: The Nei-Gojobori method assumes a single substitution rate across the gene. In reality, different protein domains evolve at different rates. The Jukes-Cantor model assumes equal substitution rates among all nucleotides, which is a simplification.
-
Somatic ns_ratio = 2.5: This genome-wide average may not be accurate for individual genes. Genes with unusual codon usage patterns may have different expected N/S ratios. More sophisticated methods (e.g., dNdScv) account for gene-specific mutational context.
-
Infinite dN/dS: In the current pipeline, only 4 of 163 candidates have zero synonymous mutations (TP53-PRAD, PTEN-BRCA, KRAS-LUAD, HNRNPD-UCEC), all of which are established cancer drivers. The previous pipeline version (pre-April 2026) had 23/25 candidates with S=0, which was addressed by switching to FDR-based filtering.
-
Multiple testing at gene level: While FDR correction is applied within the somatic analysis, the overall analysis pipeline involves many choices (which genes, which thresholds, which models) that collectively inflate the risk of finding patterns by chance.
10.3 Interpretation Limitations¶
-
Correlation vs causation: ML identifies genes whose expression correlates with tumour state. Some may be consequences of cancer (reactive changes in surrounding tissue) rather than causes.
-
Germline β somatic function: A gene conserved in the germline is important for the organism, but its role in cancer may be completely different from its normal function. Conservation tells us the protein matters, not how cancer uses it.
-
Cancer type specificity: Results are primarily driven by TCGA-BRCA data. Candidate genes may not be relevant to other cancer types. Cross-cancer validation is planned but not yet implemented for somatic analysis.
12. Glossary¶
| Term | Definition |
|---|---|
| AUC | Area Under the Curve β summary measure of ROC curve performance (0.5 = random, 1.0 = perfect) |
| Basal-like | Aggressive breast cancer subtype; triple-negative (ERβ, PRβ, HER2β), high proliferation |
| BH-FDR | Benjamini-Hochberg False Discovery Rate β multiple testing correction method |
| BRCA | Breast invasive carcinoma (TCGA project code) |
| Cancer dependency | A gene whose function cancer cells require for survival; potential drug target |
| CDS | Coding DNA Sequence β the portion of a gene that encodes protein |
| Cohen's d | Effect size measure; difference between group means in standard deviation units |
| Codon | Three-nucleotide unit of DNA/RNA that specifies one amino acid |
| dN | Rate of nonsynonymous substitutions per nonsynonymous site |
| dN/dS (Ο) | Ratio of nonsynonymous to synonymous substitution rates; measures selection pressure |
| dS | Rate of synonymous substitutions per synonymous site |
| Driver mutation | Somatic mutation that confers growth advantage to cancer cells |
| Dropout | Neural network regularisation: randomly disables neurons during training to prevent overfitting |
| FDR | False Discovery Rate β expected proportion of false positives among declared significant results |
| Feature importance | How much a gene contributes to ML model predictions |
| Gini importance | Random Forest metric: mean decrease in Gini impurity when a gene is used for tree splitting |
| Germline | Inherited genetic material; germline dN/dS measures selection across species evolution |
| Hallmarks of cancer | Set of biological capabilities acquired by cancer cells (Hanahan & Weinberg) |
| HGNC | HUGO Gene Nomenclature Committee β assigns official gene symbols |
| Jukes-Cantor | Statistical model correcting for unobserved multiple substitutions at the same site |
| L2 regularisation | Penalises large model coefficients to prevent overfitting; shrinks weights toward zero |
| Logβ fold-change | logβ(tumour expression / normal expression); measures magnitude and direction of expression change |
| LUAD | Lung adenocarcinoma (TCGA project code) |
| Luminal A | Breast cancer subtype; ER+, PR+, HER2β, low proliferation, best prognosis |
| MAF | Mutation Annotation Format β standard file format for somatic mutation data |
| MLP | Multi-Layer Perceptron β feedforward neural network with multiple hidden layers |
| MYA | Million Years Ago β unit for evolutionary divergence time |
| Nei-Gojobori | Method for computing dN/dS from pairwise sequence alignments |
| Nonsynonymous | Mutation that changes the encoded amino acid (protein-altering) |
| ns_ratio | Genome-wide ratio of nonsynonymous to synonymous sites (~2.5) |
| Ortholog | Genes in different species derived from a common ancestral gene |
| PAM50 | 50-gene panel used to classify breast cancer molecular subtypes |
| Passenger mutation | Somatic mutation with no effect on cancer fitness; neutral hitchhiker |
| Permutation test | Non-parametric significance test using random label shuffling |
| Positive selection | Evolutionary process favouring advantageous mutations (dN/dS > 1) |
| Priority score | Composite ranking combining germline conservation, somatic selection, and expression data |
| Purifying selection | Evolutionary process removing harmful mutations (dN/dS < 1) |
| q-value | FDR-adjusted p-value; probability that a result this extreme is a false positive |
| RNA-seq | RNA sequencing β high-throughput measurement of gene expression levels |
| ROC | Receiver Operating Characteristic β curve plotting sensitivity vs false positive rate |
| RSEM | RNA-Seq by Expectation Maximization β method for quantifying gene expression |
| Somatic | Mutations arising in body cells (not inherited); somatic dN/dS measures selection in tumours |
| Stratified split | Train/test partition preserving class proportions |
| Synonymous | Mutation that does NOT change the encoded amino acid (silent/neutral) |
| TCGA | The Cancer Genome Atlas β NIH-funded multi-cancer molecular characterisation project |
| Z-score | Standardised value: (x β mean) / standard deviation; centres data at 0 with unit variance |
This documentation was generated for cancertranscriptomics.space. For questions or feedback, contact Polat BakΔ±r.