Cancer Transcriptomics ML β€” Complete Documentation

Website: cancertranscriptomics.space Author: Polat BakΔ±r Purpose: Research & educational platform integrating machine-learning tumour classification with evolutionary analysis of gene signatures across 5 TCGA cancer types (BRCA, BLCA, PRAD, LUAD, UCEC)


Table of Contents

  1. Project Overview
  2. Core Hypothesis
  3. Biological Background
  4. Data Sources
  5. Page-by-Page Guide
  6. 5.1 Overview Page
  7. 5.2 Models Page
  8. 5.3 Signatures Page
  9. 5.4 Evolution Page
  10. 5.5 Results Page
  11. 5.6 How It Works Page
  12. 5.7 Methods Page
  13. Key Metrics & Statistics Explained
  14. Evolutionary Analysis In Depth
  15. Candidate Gene Profiles
  16. Validation Framework [NEW]
  17. Statistical Methods Reference
  18. Limitations & Caveats
  19. Glossary

1. Project Overview

Cancer Transcriptomics ML is a computational biology platform that asks a deceptively simple question: Can the genes a machine-learning model uses to tell tumour tissue from normal tissue also teach us something about the evolutionary forces shaping cancer?

The platform performs two complementary analyses:

Analysis Layer What It Does Timescale
Machine Learning Classification Trains three different ML models to distinguish tumour from normal tissue using RNA-seq gene expression data Present-day snapshot
Evolutionary Constraint Analysis Measures how strongly natural selection has acted on the ML-identified genes, both across species (germline) and within tumours (somatic) Millions of years (germline) to years/decades (somatic)

The key insight is that genes identified purely by their expression patterns (ML) turn out to be under unusually strong evolutionary constraint β€” they encode proteins so important for cellular function that evolution has kept them nearly unchanged for 90 million years. Yet paradoxically, these same genes accumulate protein-altering mutations in cancer at rates far above neutral expectation. This "selection paradox" is the central finding of the project.

Technical Stack

  • Backend: Python FastAPI with Jinja2 templates
  • Frontend: Vanilla JavaScript + Plotly.js for interactive charts, D3.js for the mind map
  • Data: TCGA (The Cancer Genome Atlas) RNA-seq expression, somatic mutations, and Ensembl ortholog sequences
  • ML Models: Logistic Regression, Random Forest, Multi-Layer Perceptron neural network

2. Core Hypothesis

Cancer-Maintaining Dependencies Hypothesis: Genes that are (a) identified by ML as predictive of tumour state, (b) deeply conserved across mammalian evolution (low germline dN/dS), and (c) under positive selection in tumours (high somatic dN/dS) represent core cellular functions that cancer depends on for survival.

This hypothesis rests on three pillars:

  1. ML Prediction β†’ The gene's expression level reliably differs between tumour and normal tissue, meaning cancer consistently alters this gene's activity.
  2. Germline Conservation β†’ The gene's protein has been kept nearly identical for ~90 million years of mammalian evolution, meaning the protein does something so essential that most mutations to it are lethal and removed by natural selection.
  3. Somatic Positive Selection β†’ Within breast tumours, the gene accumulates more protein-changing mutations than expected by chance, meaning tumour cells that mutate this gene gain a growth advantage.

The intersection of all three suggests these genes are not merely bystanders in cancer β€” they are load-bearing pillars that cancer cannot lose. Such genes are prime candidates for therapeutic targeting because cancer cells depend on their function.

The Three Hypotheses Tested

Hypothesis Statement Assessment
H1: ML signatures are conserved Genes predictive of tumour state encode highly conserved proteins (germline dN/dS < genome average) βœ… Supported β€” 96.4% under purifying selection, 80% under strong purifying selection
H2: Signatures are somatically selected ML-predictive genes show elevated somatic dN/dS in TCGA cancer types βœ… Supported β€” known drivers (TP53, PIK3CA, GATA3) detected; 163 total candidates across 5 cancer types (BRCA=6, BLCA=12, PRAD=1, LUAD=28, UCEC=116); 15 cross-cancer validated [UPDATED]
H3: Dual-pressure genes are dependencies Genes under both germline constraint and somatic positive selection represent cancer dependencies ⏳ Hypothesis generated β€” requires experimental validation (e.g., CRISPR screens, DepMap integration)

3. Biological Background

This section provides the foundational biology needed to understand every metric, visualisation, and interpretation on the website. It is written so that someone with basic science literacy β€” but no prior bioinformatics or cancer biology training β€” can follow the entire platform.

3.0 The Central Dogma: From DNA to Protein

All life depends on a simple information flow:

DNA  β†’(transcription)β†’  mRNA  β†’(translation)β†’  Protein
  • DNA (deoxyribonucleic acid) is the permanent instruction manual stored in every cell's nucleus. The human genome contains ~20,000 protein-coding genes spread across 23 pairs of chromosomes (~3.2 billion nucleotide base pairs: A, T, G, C).
  • mRNA (messenger RNA) is a temporary copy of a gene. When a cell needs a particular protein, it "transcribes" the gene's DNA into mRNA. The amount of mRNA for a gene in a cell reflects how actively that gene is being used β€” its expression level.
  • Proteins are the molecular machines that do the actual work: enzymes catalyse reactions, transcription factors turn genes on/off, structural proteins build cell scaffolds, receptors receive signals, and antibodies defend against pathogens.

Why this matters for the project: RNA-seq measures mRNA levels for every gene. Because mRNA is the intermediate between the genetic blueprint (DNA) and the functional machinery (protein), measuring it tells us which parts of the genome each cell is actively using. Cancer cells use a very different set of genes compared to normal cells β€” and that difference is what the ML models detect.

3.0.1 What is a Gene?

A gene is a segment of DNA that contains the instructions for building one (or sometimes more) proteins. Key gene anatomy:

  • Exons: The portions of a gene that encode protein sequence. When exons are stitched together, they form the coding DNA sequence (CDS).
  • Introns: Non-coding stretches between exons; removed during mRNA processing ("splicing").
  • Promoter: A regulatory region upstream of the gene that controls when and how much the gene is transcribed.
  • Codon: A three-nucleotide unit within the CDS that specifies one amino acid. There are 64 possible codons encoding 20 amino acids plus 3 stop signals. This redundancy (multiple codons β†’ same amino acid) is the basis for distinguishing synonymous from nonsynonymous mutations.

3.0.2 What is a Protein?

A protein is a chain of amino acids (typically 100–3,000 residues long) that folds into a specific three-dimensional structure. The shape determines function:

  • Enzymes (e.g., kinases like PIK3CA): Catalyse biochemical reactions. A single amino acid change in the active site can destroy enzymatic activity or β€” in cancer β€” lock it permanently "on."
  • Transcription factors (e.g., GATA3, FOXA1, TP53): Bind specific DNA sequences to activate or repress target genes. Mutations in their DNA-binding domains can alter which genes they regulate.
  • Structural proteins (e.g., CDH1/E-cadherin): Maintain tissue architecture. Loss of CDH1 disrupts cell-cell adhesion and drives invasive lobular breast carcinoma.
  • Receptors (e.g., ESR1/oestrogen receptor, ERBB2/HER2): Detect extracellular signals and relay them inside the cell. Amplification or mutation of receptors can make cancer cells grow without external signals.

Why protein function matters here: When we measure dN/dS, we are asking how tolerant a protein is to amino acid changes. Proteins with critical, tightly optimised structures (like transcription factor DNA-binding domains) show very low dN/dS because almost any amino acid change breaks them.

3.0.3 What is a Mutation?

A mutation is any change in the DNA sequence. Mutations can be classified by:

By origin: - Germline mutations: Present in egg or sperm cells; inherited by offspring; present in every cell of the body. These are the mutations measured by comparing human and mouse DNA (germline dN/dS). They accumulate over millions of years. - Somatic mutations: Arise in a single body cell during a person's lifetime (due to DNA replication errors, carcinogen exposure, UV radiation, etc.). NOT inherited. These are the mutations measured in TCGA tumour samples (somatic dN/dS). They accumulate over years to decades.

By effect on protein: - Synonymous (silent): Changes the DNA codon but NOT the amino acid (e.g., GCC→GCT, both = Alanine). The protein is unaffected. These serve as the neutral baseline in dN/dS analysis. - Nonsynonymous: Changes both the codon AND the amino acid: - Missense: One amino acid → different amino acid (e.g., Val→Glu in BRAF V600E). May alter protein function. - Nonsense: Creates a premature stop codon → truncated, usually non-functional protein. - Frameshift: Insertion or deletion that shifts the reading frame → completely garbled protein downstream. - Splice-site: Disrupts mRNA splicing → abnormal exon usage → altered or absent protein.

3.0.4 What is Natural Selection?

Natural selection is the process by which organisms with traits that enhance survival and reproduction become more common in a population over time. At the molecular level:

  • Purifying (negative) selection: Harmful mutations are removed from the population because organisms carrying them are less fit. Proteins under purifying selection are conserved β€” they change very slowly over evolutionary time. A gene with dN/dS = 0.05 has had 95% of its amino acid-changing mutations eliminated by purifying selection.
  • Positive (Darwinian) selection: Beneficial mutations spread through the population because they confer an advantage. Proteins under positive selection accumulate amino acid changes faster than silent changes (dN/dS > 1). In cancer, "positive selection" means mutations that help tumour cells grow and survive.
  • Neutral drift: Mutations with no effect on fitness accumulate randomly (dN/dS β‰ˆ 1). These are neither helpful nor harmful.

The key insight of this project: The same gene can be under purifying selection in the germline (the protein is essential β€” don't change it) and under positive selection in somatic tumour cells (cancer benefits from specific modifications). This paradox reveals cancer's strategy: hijacking the cell's most critical machinery.

3.0.5 What is an Ortholog?

An ortholog is a gene in a different species that descended from the same ancestral gene through speciation (not gene duplication). Orthologs typically retain the same function across species:

  • Human TP53 and Mouse Trp53 are orthologs β€” both encode the p53 tumour suppressor protein.
  • They share ~80% amino acid identity after ~90 million years of independent evolution.
  • The fact that they are so similar after 90 million years means natural selection has been strongly conserving this protein in both lineages.

Evolutionary distances used in this project:

Species Pair Divergence Time What It Tells Us
Human ↔ Mouse ~90 MYA Primary comparison; sufficient divergence to measure selection
Human ↔ Rat ~90 MYA Independent replicate of mouse comparison (rodent lineage)
Human ↔ Dog ~96 MYA Non-rodent comparison; validates patterns
Human ↔ Zebrafish ~435 MYA Extremely distant; only the most ancient, universally essential proteins remain conserved

If a protein is >90% identical between human and zebrafish (435 million years apart), it is one of the most constrained proteins in the vertebrate genome β€” it performs a function so fundamental that it has been essentially unchanged since before the age of dinosaurs.

3.0.6 What is Cancer? A Molecular Perspective

Cancer is fundamentally a disease of uncontrolled cell growth driven by accumulated genetic alterations. A normal cell becomes cancerous through a multi-step process:

  1. Initiation: A cell acquires a mutation in a key growth-control gene (an "initiating" driver mutation). For example, a BRCA1 mutation compromises DNA repair.

  2. Promotion: Additional mutations accumulate over years/decades. Each mutation may provide a slight growth advantage β€” the cell divides a bit faster, survives a bit longer, or evades immune detection slightly better.

  3. Progression: The tumour becomes increasingly aggressive, invading surrounding tissues and eventually metastasising to distant organs.

A typical breast tumour carries 30–80 coding mutations, but only 3–6 are drivers β€” the rest are neutral passengers. This project's somatic dN/dS analysis aims to identify which genes carry driver mutations across the TCGA-BRCA cohort.

Cancer as evolution: A tumour is a population of cells undergoing Darwinian evolution. Cells with growth-promoting mutations outcompete neighbouring cells. This is why somatic dN/dS > 1 for driver genes β€” the nonsynonymous mutations are being positively selected because they help the tumour cell lineage expand. This within-patient evolution occurs on a timescale of years to decades, whereas germline evolution between species occurs over millions of years.

3.0.7 What is Gene Expression and Why Does Cancer Change It?

Gene expression refers to how "active" a gene is β€” quantified by the amount of mRNA it produces. In a normal cell, gene expression is tightly regulated: growth genes are activated when the cell needs to divide, then silenced when division is complete. In cancer:

  • Oncogenes (growth-promoting genes like MYC, PIK3CA, ERBB2) become overexpressed β€” stuck in the "on" position, driving constant proliferation.
  • Tumour suppressors (growth-inhibiting genes like TP53, RB1, BRCA1) become silenced or mutated β€” the brakes are removed.
  • Metabolic genes are reprogrammed β€” cancer cells switch to glycolysis even in the presence of oxygen (the Warburg effect).
  • Immune-evasion genes are activated β€” cancer hides from the immune system by expressing checkpoint ligands (PD-L1) or reducing antigen presentation.

These expression changes are massive, consistent, and detectable by ML. A gene like MMP11 (matrix metalloproteinase 11) may be expressed at 50Γ— higher levels in breast tumours versus normal tissue β€” an easy signal for any classifier to learn. The fact that ML models achieve >99% accuracy tells us that cancer's transcriptomic rewiring is profound and reproducible across patients.

3.1 What is RNA-seq Transcriptomics?

Every cell in your body contains the same DNA, but different cell types express (activate) different sets of genes. RNA-seq (RNA sequencing) measures the activity level of every gene in a tissue sample by counting how many messenger RNA (mRNA) copies each gene has produced.

  • A gene with high expression produces many mRNA transcripts β†’ the cell is actively using that gene's protein product.
  • A gene with low or zero expression is effectively "turned off" in that tissue.

In cancer, gene expression is dramatically altered. Tumour suppressor genes may be silenced while oncogenes are hyperactivated. By comparing expression profiles of tumour vs normal tissue, we can identify which genes are consistently dysregulated β€” and that is exactly what the ML models learn to do.

Key numbers in this project: - ~20,000 genes measured per sample - ~13,660 genes retained after filtering low-variance genes - Expression values are RSEM-normalised expected counts from TCGA

3.2 What is TCGA?

The Cancer Genome Atlas (TCGA) is a landmark NIH-funded project that molecularly characterised over 20,000 primary cancers across 33 cancer types. It provides:

  • RNA-seq expression for every sample (what this project uses for ML)
  • Whole-exome somatic mutation data (what this project uses for somatic dN/dS)
  • Clinical metadata: tumour stage, molecular subtype, survival data

For breast cancer (BRCA), TCGA provides ~1,218 samples: approximately 1,104 primary tumour samples and 114 solid tissue normal samples from adjacent tissue.

3.3 Breast Cancer Molecular Subtypes (PAM50)

Breast cancer is not one disease β€” it comprises molecularly distinct subtypes identified by the PAM50 gene panel:

Subtype Frequency Key Features Prognosis
Luminal A ~40% ER+, PR+, HER2βˆ’, low proliferation Best
Luminal B ~20% ER+, PRΒ±, HER2Β±, high proliferation Intermediate
HER2-enriched ~15% HER2 amplified, ERβˆ’, PRβˆ’ Poor (without targeted therapy)
Basal-like ~15% Triple-negative (ERβˆ’, PRβˆ’, HER2βˆ’), high proliferation Worst
Normal-like ~10% Resembles normal breast tissue Variable

This project's Subtype task trains models to distinguish Luminal A from Basal-like β€” the two most molecularly distinct subtypes β€” achieving perfect classification (100% AUC), demonstrating how profoundly different their transcriptomic landscapes are.

3.4 The dN/dS Ratio β€” The Gold Standard for Measuring Selection

The dN/dS ratio (also called Ο‰, omega) is the single most important metric on this website. Understanding it deeply is essential to interpreting every chart and table.

What Are Synonymous and Nonsynonymous Mutations?

DNA is read in codons β€” triplets of nucleotides that each specify an amino acid. Due to the redundancy of the genetic code, some DNA changes alter the resulting amino acid (and therefore the protein) while others do not:

  • Synonymous (silent) substitution: A DNA change that does NOT alter the amino acid. Example: GCC β†’ GCT both encode Alanine. The protein is unchanged.
  • Nonsynonymous substitution: A DNA change that DOES alter the amino acid. Example: GCC (Ala) β†’ GAC (Asp). The protein's structure and function may be affected.

Why Does This Matter?

Synonymous changes are largely invisible to natural selection β€” the protein is the same regardless. They accumulate at a roughly constant rate over evolutionary time and serve as a molecular clock β€” a baseline mutation rate.

Nonsynonymous changes, however, ARE visible to selection because they alter the protein. If a nonsynonymous change is harmful, natural selection will eliminate the organisms carrying it (purifying selection). If beneficial, selection will spread it through the population (positive selection).

The Ratio

dN/dS = (rate of nonsynonymous substitution) / (rate of synonymous substitution)
dN/dS Value Interpretation Biological Meaning
dN/dS β‰ͺ 1 (e.g., 0.05) Strong purifying selection The protein is under intense functional constraint. Almost every amino acid change is harmful and removed by selection. The gene encodes something essential.
dN/dS < 1 (e.g., 0.3–0.9) Purifying selection The protein is functionally important but tolerates some variation.
dN/dS β‰ˆ 1 Neutral evolution Nonsynonymous changes accumulate at the same rate as synonymous β€” the protein is under no selective pressure.
dN/dS > 1 (e.g., 2.0+) Positive selection Nonsynonymous changes accumulate faster than synonymous. The protein is being actively modified by selection β€” amino acid changes confer an advantage.
dN/dS = ∞ Infinite (n_syn = 0) All observed mutations are protein-altering, with zero synonymous mutations. Strong positive selection signal, though with high statistical uncertainty due to lack of synonymous baseline.

3.5 Germline vs Somatic dN/dS β€” Two Timescales of Selection

This project applies dN/dS analysis at two fundamentally different timescales:

Germline dN/dS (Deep Evolutionary Conservation)

  • What it measures: How much the protein has changed between human and mouse since their common ancestor ~90 million years ago.
  • Timescale: Millions of years of evolution.
  • What it tells us: If a protein has remained nearly identical for 90 million years of mammalian evolution, it performs a function so critical that almost any change to it is lethal. These are the cell's most essential genes.
  • Method: Nei-Gojobori with Jukes-Cantor correction (see Section 7).

Somatic dN/dS (Within-Tumour Selection)

  • What it measures: Whether protein-altering mutations in a gene accumulate more frequently than expected by chance across TCGA-BRCA tumour samples.
  • Timescale: Years to decades (the lifetime of a tumour).
  • What it tells us: If a gene has somatic dN/dS > 1, tumour cells that acquire mutations in this gene have a growth advantage β€” the mutations are being positively selected during tumour evolution.
  • Method: Binomial test comparing observed nonsynonymous/synonymous ratio to genome-wide expectation (ns_ratio β‰ˆ 2.5).

Critical caveat: Germline and somatic dN/dS values are NOT directly comparable in absolute magnitude. They operate on different timescales, use different methods, and have different baselines. The website uses them for relative ranking within each domain and for identifying genes that are extreme in BOTH domains simultaneously.

3.6 The Cancer Hallmarks

Cancer is characterised by a set of acquired capabilities known as the Hallmarks of Cancer (Hanahan & Weinberg, 2000; updated 2011 and 2022). These are the fundamental biological programmes that all cancers must activate:

# Hallmark What It Means Example Gene on This Site
1 Sustaining proliferative signalling Cancer cells produce their own growth signals or amplify receptors so they don't need external permission to divide PIK3CA (constitutively activates PI3K growth pathway)
2 Evading growth suppressors Normal cells have "brakes" β€” tumour suppressors β€” that stop division when something is wrong. Cancer disables these brakes. TP53 (the "guardian of the genome"; mutated in >50% of all cancers)
3 Resisting cell death (apoptosis) Damaged cells normally self-destruct via programmed cell death. Cancer cells disable the self-destruct mechanism. BCL2 family (anti-apoptotic proteins often overexpressed in cancer)
4 Enabling replicative immortality Normal cells can only divide ~50–70 times (Hayflick limit) before their telomeres shorten critically. Cancer cells activate telomerase to maintain telomeres indefinitely. TERT (telomerase reverse transcriptase)
5 Inducing angiogenesis Tumours beyond ~1mmΒ³ need their own blood supply. Cancer cells secrete signals that recruit new blood vessels. VEGF signalling pathway
6 Activating invasion & metastasis Cancer cells break free from their tissue of origin, invade surrounding structures, and colonise distant organs. CDH1/E-cadherin (loss enables cells to detach); MMP11 (degrades extracellular matrix)
7 Deregulating cellular energetics Cancer cells reprogram their metabolism to fuel rapid growth, even using less efficient energy pathways (Warburg effect: aerobic glycolysis). Metabolic genes in ML signatures
8 Avoiding immune destruction The immune system normally detects and kills abnormal cells. Cancer learns to evade or suppress immune responses. PD-L1 expression, MHC class I downregulation

Emerging hallmarks (Hanahan, 2022): unlocking phenotypic plasticity, non-mutational epigenetic reprogramming, polymorphic microbiomes, and senescent cells.

Connection to this project: The ML models learn to classify tumour vs normal tissue by detecting expression changes across all these hallmarks simultaneously. The gene signatures are enriched for genes involved in proliferation (hallmark 1), growth suppression evasion (hallmark 2), and invasion (hallmark 6) β€” precisely because these programmes are most dramatically altered in cancer.

3.7 The Tumour Microenvironment

A tumour is not just cancer cells. The tumour microenvironment (TME) is a complex ecosystem containing:

  • Cancer cells: The malignant cells carrying driver mutations.
  • Cancer-associated fibroblasts (CAFs): Stromal cells recruited by the tumour that produce extracellular matrix and growth factors. Genes like MMP11 and MFAP5 (both in the candidate list) are expressed by CAFs.
  • Immune cells: T cells, macrophages, natural killer cells β€” some attack the tumour, others are co-opted to support it.
  • Endothelial cells: Form blood vessels feeding the tumour.
  • Extracellular matrix (ECM): The structural scaffold surrounding cells; cancer remodels the ECM to facilitate invasion.

Why this matters for RNA-seq analysis: Bulk RNA-seq (as used in TCGA) measures the average gene expression across ALL cell types in a tissue sample β€” cancer cells, fibroblasts, immune cells, and stroma mixed together. This means:

  • Some ML-predictive genes may be expressed by cancer cells themselves (intrinsic cancer biology)
  • Others may reflect the tumour microenvironment's response (e.g., immune infiltration markers, CAF genes)
  • The distinction matters for therapeutic targeting but cannot be resolved by bulk RNA-seq alone (single-cell RNA-seq is needed)

3.8 Driver vs Passenger Mutations

Not all somatic mutations in a tumour contribute to cancer:

  • Driver mutations: Confer a selective growth advantage to the tumour cell. They are positively selected and recur across independent tumours. Detectable by somatic dN/dS > 1.
  • Passenger mutations: Neutral "hitchhikers" that happened to be present in the cell when a driver mutation occurred. They accumulate passively and show dN/dS β‰ˆ 1.

3.9 Breast Cancer Biology in Depth

Breast cancer is the most common cancer in women worldwide (~2.3 million new cases/year). Understanding its biology is essential for interpreting this platform's results.

Anatomy and Cell Types

The breast contains mammary glands (lobules) connected by ducts, embedded in fatty and connective tissue. Two key cell types line the ducts:

  • Luminal epithelial cells: Line the inner surface of ducts. Express oestrogen receptor (ER/ESR1) and progesterone receptor (PR/PGR). Most breast cancers arise from these cells (Luminal A/B subtypes).
  • Basal/myoepithelial cells: Form the outer layer of ducts. Express keratins (KRT5, KRT14) and contractile proteins. Basal-like breast cancers resemble these cells.

The molecular subtype of a breast cancer reflects which cell type it most resembles and which signalling programmes are active:

Subtype Resembles Key Receptors Key Pathways Treatment
Luminal A Luminal cells ER+, PR+, HER2βˆ’ Oestrogen signalling, low proliferation Endocrine therapy (tamoxifen, aromatase inhibitors)
Luminal B Luminal cells ER+, PRΒ±, HER2Β± Oestrogen signalling + high proliferation Endocrine therapy + chemotherapy
HER2-enriched Variable ERβˆ’, PRβˆ’, HER2+ ERBB2/HER2 amplification β†’ MAPK/PI3K Anti-HER2 therapy (trastuzumab)
Basal-like Basal cells ERβˆ’, PRβˆ’, HER2βˆ’ High proliferation, DNA damage response Chemotherapy (no targeted therapy available)

Key transcription factors in this project's candidates: - GATA3: Master regulator of luminal differentiation. Directly activates ER and luminal keratins. Mutations cluster in zinc finger domains and frameshifts in the C-terminus. Present in ~10% of breast cancers. Its somatic dN/dS of 19.8 (99 nonsynonymous mutations, only 2 synonymous) makes it one of the most strongly selected genes on this platform. - FOXA1: Pioneer factor that opens compacted chromatin specifically at ER binding sites. Without FOXA1, ER cannot access its target genes. Mutations reprogram which genes ER activates, potentially driving therapy resistance. - FOXC1: Marker of basal-like subtype. Promotes epithelial-to-mesenchymal transition (EMT), increasing invasiveness.

Key Signalling Pathways in Breast Cancer

Pathway Key Genes Role in Cancer Connection to This Project
PI3K/AKT/mTOR PIK3CA, AKT1, PTEN, mTOR Cell growth, survival, metabolism. PIK3CA is mutated in ~36% of breast cancers. PIK3CA has somatic dN/dS β‰ˆ 17.6; strong positive selection
ER signalling ESR1, FOXA1, GATA3 Drives luminal gene expression, proliferation in ER+ tumours FOXA1 and GATA3 are both candidates with extreme somatic dN/dS
p53 pathway TP53, MDM2, CDKN1A DNA damage response, apoptosis. TP53 mutated in ~37% of BRCA cases (>80% of basal-like) TP53 has somatic dN/dS β‰ˆ 35.9; the most selected gene
Cell adhesion CDH1, CTNNA1, DSC2 Cell-cell junctions. CDH1 loss = lobular carcinoma. DSC2 (desmosomal cadherin) is a candidate gene DSC2 is in the candidate list (germline dN/dS = 0.197, somatic = ∞)
WNT signalling FZD9, DKK4 Embryonic development, stem cell maintenance. Aberrant activation drives cancer stem cells FZD9 (Frizzled-9) and DKK4 are both candidates

3.10 The Selection Paradox β€” Why This Project Matters

The central finding of this project can be framed as a paradox:

The genes that evolution has tried hardest to protect (low germline dN/dS) are the same genes that cancer most aggressively modifies (high somatic dN/dS).

This is not a contradiction β€” it reveals cancer's strategy:

  1. Essential genes encode essential proteins. The cell depends on them for fundamental processes: transcription regulation, signal transduction, cell adhesion, DNA repair.

  2. Cancer cannot simply delete these genes. If the cell loses TP53 entirely, it may die from accumulated DNA damage. If it loses CDH1, the tissue may fall apart in ways that don't benefit the tumour.

  3. Instead, cancer acquires specific modifications β€” gain-of-function mutations in TP53, activating mutations in PIK3CA, truncating mutations in GATA3 that alter (but don't destroy) its transcription factor activity.

  4. These are dependency genes: Cancer cells depend on the modified function of these proteins. Restoring normal function (or selectively targeting the mutant form) could specifically kill cancer cells while sparing normal tissue.

This is why the candidate gene list is not just an academic exercise β€” it points to potential therapeutic vulnerabilities. If a gene is both essential (conserved) and modified by cancer (somatically selected), it is a strong candidate for drug targeting.


4. Data Sources

Data Type Source Details
RNA-seq Expression TCGA via UCSC Xena Browser RSEM-normalised expected counts, ~20,000 genes, 33 cancer types
BRCA Samples TCGA-BRCA 1,104 tumours + 114 solid tissue normals = 1,218 samples
Somatic Mutations TCGA-BRCA WES MAF files ~77,000 coding mutations from tumour-normal pairs
Ortholog Sequences Ensembl BioMart (release 110+) Human-mouse/rat/dog/zebrafish one-to-one orthologs
Protein Identity Ensembl BioMart Percent amino acid identity for each ortholog pair
CDS Alignments Ensembl BioMart Codon-aligned coding sequences for dN/dS calculation

Data Processing Pipeline

Raw TCGA RNA-seq β†’ Logβ‚‚(x+1) transform β†’ Variance filtering (remove bottom 20%)
β†’ Z-score standardisation (per-gene, training set) β†’ Stratified 80/20 split (seed=42)
β†’ ML training & evaluation β†’ Gene signature extraction β†’ Evolutionary analysis

5. Page-by-Page Guide

5.1 Overview Page

URL: cancertranscriptomics.space/ (home page)

The Overview page serves as the entry point and executive summary for the entire project. It presents the core hypothesis, key results, and navigation to detailed analyses.

Stat Cards (Top Row)

Six summary statistics are displayed as coloured cards:

Card Value Colour What It Means
BRCA ROC AUC 0.999 Blue The best model achieves near-perfect tumour/normal discrimination in breast cancer. An AUC of 0.999 means if you randomly pick one tumour and one normal sample, the model correctly ranks the tumour higher 99.9% of the time.
Multi-cancer Accuracy 99.8% Blue When classifying tumour vs normal across all 33 TCGA cancer types simultaneously, the model achieves 99.8% accuracy β€” demonstrating that transcriptomic tumour signatures are robust across cancer types.
Cross-cancer AUC 0.998 Blue A model trained only on breast cancer data can classify lung adenocarcinoma (LUAD) samples with AUC 0.998 β€” proving that the learned expression signatures capture universal cancer biology, not tissue-specific artefacts.
Signature Genes 132 Indigo The union of top-50 genes from each model across all tasks yields 132 unique ML-predictive genes. These form the "gene signature" β€” the minimal set capturing most discriminative information.
Under Purifying Selection 96.4% Green Of the 110 signature genes with valid germline dN/dS data, 96.4% have dN/dS < 1 β€” they are under evolutionary constraint. Their proteins have been conserved for ~90 million years.
Positively Selected (Somatic) 4,615 Red Across the entire genome, 4,615 genes show statistically significant positive selection in TCGA-BRCA tumours (somatic dN/dS > 1, FDR q < 0.05).

Classification Tasks Table

The project trains ML models on four distinct classification tasks to test robustness and generalisability:

Task Training Data Test Data What It Tests
Single-cancer (BRCA) BRCA tumour + normal Held-out BRCA Can expression distinguish breast tumour from normal?
Multi-cancer All 33 TCGA tumour types + normals Held-out mix Is the tumour signature universal across cancer types?
Subtype BRCA Luminal A + Basal-like Held-out subtypes Can expression distinguish molecular subtypes?
Cross-cancer (BRCA→LUAD) BRCA only Lung adenocarcinoma (LUAD) Do breast cancer signatures transfer to other organs?

Key Findings Section

Three major findings are presented:

  1. Germline Conservation: ML-predictive genes have significantly lower germline dN/dS (mean 0.234) compared to random background genes, indicating strong evolutionary constraint. 80% are under strong purifying selection (dN/dS < 0.3).

  2. Somatic Positive Selection: Known breast cancer drivers (TP53, PIK3CA, GATA3, CDH1, FOXA1) appear among the top somatically selected genes, validating the method. 163 total candidates across 5 cancer types pass all three filters using the updated thresholds (somatic dN/dS β‰₯ 1.5, FDR q < 0.05, CI lower bound > 1.0). [UPDATED]

  3. The Selection Paradox: Genes conserved for 90 million years (essential, don't-touch-these-proteins) are the same genes accumulating protein-altering mutations in cancer. Cancer selectively breaks the cell's most critical machinery.

Performance Overview Chart

An interactive Plotly bar chart displays ROC AUC for each model Γ— task combination. This visualisation allows direct comparison of model architectures across classification challenges.

Clickable cards link to each analysis page with a brief description and distinctive icon, guiding users through the logical flow: Models β†’ Signatures β†’ Evolution β†’ Results.


5.2 Models Page

URL: cancertranscriptomics.space/models

This page presents the three ML model architectures and their performance across all classification tasks.

Model Architecture Cards

Three cards describe each model in detail:

Logistic Regression (L2-regularised)
P(tumour) = Οƒ(Ξ²β‚€ + β₁·gene₁ + Ξ²β‚‚Β·geneβ‚‚ + ... + Ξ²β‚™Β·geneβ‚™)
  • Architecture: Linear classifier with sigmoid activation
  • Regularisation: L2 penalty (C=1.0) β€” shrinks coefficients toward zero, preventing any single gene from dominating
  • Feature importance: |Ξ²α΅’| β€” the absolute value of each gene's coefficient. Larger |Ξ²| = more discriminative gene
  • Coefficient sign: Positive Ξ² = gene is upregulated in tumours; negative Ξ² = downregulated in tumours
  • Strengths: Most interpretable model. Each gene gets exactly one number (its coefficient) telling you how much it contributes to classification and in which direction.
  • Biological value: The sign of the coefficient directly tells you whether the gene is over- or under-expressed in cancer, which is immediately biologically interpretable.
Random Forest Classifier
  • Architecture: Ensemble of 100–500 decision trees; each tree trained on a random subset of samples and genes
  • Decision logic: Each tree asks binary questions ("Is gene X expression > threshold?") to partition samples. Final prediction = majority vote across all trees.
  • Feature importance (Gini): Mean decrease in Gini impurity when a gene is used for splitting across all trees. Higher = gene is more useful for separating tumour from normal.
  • Strengths: Captures non-linear relationships and gene-gene interactions. Robust to noise and outliers. Does not assume linear separability.
  • Biological value: Can detect cases where a gene is only discriminative in combination with another gene (epistatic interactions in expression space).
Neural Network (Multi-Layer Perceptron)
Input(13,660) β†’ Dense(512, ReLU, Dropout 0.3) β†’ Dense(256, ReLU, Dropout 0.3)
              β†’ Dense(128, ReLU, Dropout 0.3) β†’ Dense(1, Sigmoid)
  • Architecture: Deep feedforward network with three hidden layers
  • Regularisation: Dropout (p=0.3) at each layer + early stopping on validation loss
  • Training: Adam optimiser, binary cross-entropy loss, batch size 32
  • Feature importance: Mean |W₁ᡒ| β€” average absolute weight connecting each input gene to the first hidden layer. Genes with large first-layer weights receive more "neural attention."
  • Strengths: Can learn complex, hierarchical representations. Captures subtle combinatorial patterns across thousands of genes simultaneously.
  • Biological value: May identify complex regulatory networks where the importance of a gene depends on the expression context of many other genes.

Performance Metrics Visualisations

Four grouped bar charts display model performance for each metric:

ROC AUC (Receiver Operating Characteristic β€” Area Under Curve)

What it measures: The model's ability to rank tumour samples higher than normal samples across all possible classification thresholds.

  • AUC = 1.0: Perfect separation β€” every tumour sample is scored higher than every normal sample
  • AUC = 0.5: Random chance β€” the model is guessing
  • AUC > 0.99: Exceptional discrimination β€” the model almost never confuses tumour and normal

Why it matters for cancer: ROC AUC is threshold-independent, meaning it evaluates the model's overall discriminative ability regardless of where you set the "call it tumour" cutoff. This is crucial in clinical settings where the optimal threshold depends on the cost of false positives vs false negatives.

Results on this site:

Task Best Model ROC AUC
Single-cancer (BRCA) Logistic Regression 0.999
Multi-cancer Logistic Regression 0.9998
Subtype (LumA vs Basal) All models 1.000
Cross-cancer (BRCA→LUAD) Logistic Regression 0.998
External Validation (BRCA) RF & MLP 1.000
Pan-cancer (14 types) Logistic Regression 0.974
Accuracy

What it measures: The proportion of all predictions (tumour + normal) that are correct.

Accuracy = (True Positives + True Negatives) / Total Samples

Cancer context: High accuracy alone can be misleading if classes are imbalanced (e.g., 90% tumour samples β†’ a model that always says "tumour" gets 90% accuracy). That's why AUC, precision, and recall are reported alongside accuracy.

Precision

What it measures: Of all samples the model calls tumour, what fraction actually are tumour?

Precision = True Positives / (True Positives + False Positives)

Cancer context: High precision means the model rarely calls a normal sample "tumour" (low false positive rate). In a diagnostic context, this means fewer unnecessary biopsies or treatments.

Recall (Sensitivity)

What it measures: Of all actual tumour samples, what fraction does the model correctly identify?

Recall = True Positives / (True Positives + False Negatives)

Cancer context: High recall means the model rarely misses a real tumour (low false negative rate). In a screening context, this is critical β€” a missed cancer is far more dangerous than a false alarm.

Complete Results Table

A sortable table shows all metrics for every model Γ— task combination (18 rows total). Users can click column headers to sort by any metric.

Notes on Perfect Scores

The Subtype classification task achieves 100% across all metrics for all three models. This is biologically expected: Luminal A and Basal-like breast cancers have profoundly different transcriptomic profiles driven by entirely different molecular programmes (hormone signalling vs proliferation), making them trivially separable by any competent classifier. Perfect scores here validate the data quality and preprocessing rather than indicating model overfitting.

Similarly, External Validation achieving 100% AUC for RF and MLP demonstrates that the learned signatures genuinely capture cancer biology rather than training-set noise.


5.3 Signatures Page

URL: cancertranscriptomics.space/signatures

This page reveals which specific genes each model considers most important for classification, and how genes overlap across models and tasks.

What is a Gene Signature?

A gene signature is the set of genes whose expression levels most strongly contribute to a model's classification decisions. For each model and task:

  1. All genes are ranked by importance (model-specific metric)
  2. The top 50 genes form that model's "signature" for that task
  3. The union across all models and tasks gives the complete signature set

The project identifies 132 unique signature genes across all models and tasks.

Task Selector

A dropdown allows switching between classification tasks: - BRCA: Single-cancer tumour vs normal - Unified: Combined importance across tasks - Multi-cancer (Pan-cancer): Pan-cancer discrimination - Subtype (Luminal A vs Basal-like): Luminal A vs Basal-like - Cross-cancer (BRCA β†’ LUAD): BRCA-trained, LUAD-tested

Model Selector

A toggle group switches between feature importance methods: - RF: Random Forest Gini importance - MLP: Neural network first-layer weights - LR: Logistic Regression absolute coefficients

Feature Importance Bar Chart

An interactive horizontal bar chart shows the top 30 genes ranked by importance for the selected task and model. Each bar's length represents the gene's importance score.

How to read it: - Longer bars = more important for classification - Genes at the top are the most discriminative - Hover over any bar to see the exact importance value - Compare across models (using the model selector) to see which genes are consistently important

Notable genes that frequently appear: - MMP11 (Matrix Metalloproteinase 11): Extracellular matrix degradation enzyme upregulated in invasive cancers. Important across multiple tasks. - FAM13A: GTPase-activating protein involved in metabolic regulation; frequently altered in cancers. - PPP1R12B: Protein phosphatase regulatory subunit involved in smooth muscle contraction and cytoskeletal regulation. - TMEM220: Transmembrane protein implicated as a tumour suppressor in digestive tract cancers.

Gene Ranking Table

A complete sortable table listing all genes in the selected signature with columns: - Rank: Position in the importance ranking - Gene: Official gene symbol (HGNC) - Importance: The model-specific importance score

Gene Categories Pie Chart

An interactive pie chart shows how signature genes distribute across functional categories:

Category Count Definition Biological Meaning
Shared Multi-cancer 18 Important in both BRCA-specific and pan-cancer tasks These genes participate in universal cancer biology β€” processes dysregulated across many cancer types (proliferation, apoptosis evasion)
Subtype-specific 50 Important primarily for distinguishing Luminal A vs Basal-like These genes reflect hormone receptor signalling (ESR1, FOXA1) and proliferation programmes that differ between subtypes
Cross-cancer Stable 18 Genes whose importance transfers from BRCA to LUAD Tissue-independent cancer markers β€” these genes reflect shared tumourigenic mechanisms
Multi-cancer Only 32 Important only in multi-cancer classification Pan-cancer specific signals
Subtype Only 50 Important only in subtype classification Subtype-specific biology (overlaps heavily with subtype_specific)
Cross-cancer Only 32 Important only in cross-cancer task Cross-tissue transferable signals

5.4 Evolution Page

URL: cancertranscriptomics.space/evolution

This is the most scientifically rich page on the website. It presents the evolutionary analysis of ML-identified signature genes across two complementary dimensions: germline conservation and somatic selection.

The page is divided into two columns reflecting the two evolutionary timescales.

Left Column: Germline Conservation (Green Theme)

Stat Cards
Stat Value Meaning
Predictive Genes Analysed 110 (of 132 with valid dN/dS) Number of ML-signature genes for which human-mouse ortholog dN/dS could be computed
Mean Germline dN/dS (Predictive) 0.234 Average evolutionary rate β€” well below 1.0, indicating pervasive purifying selection
Mean Germline dN/dS (Background) 0.203 Random genome-wide genes β€” also under purifying selection but the comparison reveals relative constraint
% Under Purifying Selection 96.4% Fraction of signature genes with dN/dS < 1.0
% Under Strong Purifying Selection 80.0% Fraction with dN/dS < 0.3 β€” encoding highly constrained, essential proteins
dN/dS Distribution Violin Plot

This visualisation shows the distribution of germline dN/dS values for ML-predictive genes vs random background genes.

How to read it: - The x-axis shows the dN/dS value (lower = more conserved) - Each "violin" shows the density of genes at each dN/dS value - The wider the violin at a particular dN/dS value, the more genes have that value - The median line shows the central tendency

Key observation: The predictive-gene violin is shifted left (toward lower dN/dS) compared to background, meaning ML-identified genes tend to be more conserved than random genes. The bulk of predictive genes cluster below dN/dS = 0.3, indicating strong evolutionary constraint.

Multi-Species Protein Identity Bar Chart

This chart compares average protein sequence identity (%) between ML-predictive genes and background genes across four species at different evolutionary distances:

Species Divergence Time Predictive Mean %ID Background Mean %ID
Mouse ~90 MYA 81.8% 82.4%
Rat ~90 MYA 81.4% 81.8%
Dog ~96 MYA 83.9% 84.4%
Zebrafish ~435 MYA 53.0% 57.4%

Biological interpretation: Higher protein identity means the protein sequence has been more conserved across the species split. The comparison between predictive and background genes at each evolutionary distance reveals whether ML-identified genes are under stronger constraint.

Subtype-specific genes show the highest protein identity across mouse (86.1%), rat (83.9%), and dog (86.6%), suggesting that the transcription factors and signalling molecules distinguishing breast cancer subtypes are among the most ancient and conserved proteins in the mammalian genome.

Germline Gene Table

A sortable table listing every ML-predictive gene with its germline evolutionary data:

Column Description
Gene Official gene symbol
Category ML category (shared_multicancer, subtype_specific, etc.)
Mouse %ID Protein sequence identity with mouse ortholog
Rat %ID Protein sequence identity with rat ortholog
Dog %ID Protein sequence identity with dog ortholog
Zebrafish %ID Protein sequence identity with zebrafish ortholog
dN/dS Germline dN/dS ratio (human vs mouse)
Selection Classification badge: "Strongly Purifying" (< 0.3), "Purifying" (0.3–1.0), "Positive/Neutral" (> 1.0)

Right Column: Somatic Selection (Red Theme)

Stat Cards
Stat Value Meaning
Genes Tested 13,208 Total genes with at least 2 coding somatic mutations in TCGA-BRCA
Somatic dN/dS > 1 8,207 Genes showing more nonsynonymous mutations than expected (potential positive selection)
Significant (FDR < 0.05) 4,615 Genes passing multiple-testing correction at 5% false discovery rate
Significant (FDR < 0.10) 4,697 A slightly more permissive threshold
Germline vs Somatic Scatter Plot (KEY VISUALISATION)

This is arguably the most important chart on the entire website. It plots every gene with both germline and somatic dN/dS data in a 2D space:

  • X-axis: Germline dN/dS (human-mouse evolutionary conservation)
  • Y-axis: Somatic dN/dS (positive selection in TCGA-BRCA tumours), capped at 25 for visualisation (genes with infinite somatic dN/dS are plotted at y=25)
  • Colour: Orange = ML-predictive gene, Blue = background gene

How to read the quadrants:

                    HIGH somatic dN/dS (y > 1)
                    β”‚
   Not conserved,   β”‚   CONSERVED AND
   selected in      β”‚   SELECTED IN CANCER
   cancer           β”‚   ← KEY QUADRANT (bottom-right
   (rare)           β”‚      if axes standard,
                    β”‚      top-left if low germline
                    β”‚      is on left)
────────────────────┼────────────────────
   Not conserved,   β”‚   Conserved,
   not selected     β”‚   not selected
   (neutral genes)  β”‚   in cancer
                    β”‚   (housekeeping)
                    β”‚
                    LOW somatic dN/dS (y β‰ˆ 1)
   HIGH germline ←──┼──→ LOW germline dN/dS
   dN/dS            β”‚
  • Bottom-left (low germline, low somatic): Genes that are conserved and not mutated in cancer β€” essential housekeeping genes that cancer leaves alone.
  • Top-left (low germline, high somatic): THE MOST INTERESTING QUADRANT β€” genes conserved for 90 million years but positively selected in tumours. These are the cancer-dependency candidates.
  • Bottom-right (high germline, low somatic): Genes under little evolutionary constraint and not selected in cancer β€” likely neutral or tissue-specific.
  • Top-right (high germline, high somatic): Genes under neither germline constraint nor somatic selection β€” rapidly evolving and not cancer-relevant.

What to look for: ML-predictive genes (orange) that appear in the upper-left region β€” conserved AND somatically selected. These are the strongest candidate cancer dependencies.

Somatic dN/dS Distribution Histogram

A histogram showing the distribution of somatic dN/dS values across all tested genes.

Key features: - A large peak near dN/dS β‰ˆ 1 (neutral β€” most genes are passengers) - A long right tail of genes with dN/dS >> 1 (drivers) - Known drivers like TP53 (dN/dS β‰ˆ 35.9) and PIK3CA (dN/dS β‰ˆ 17.6) appear in the extreme right tail

Top Somatically Selected Genes Table

A sortable table showing genes with the highest somatic dN/dS, with columns:

Column Description
Gene Gene symbol
Nonsynonymous (N) Count of protein-altering somatic mutations across all TCGA-BRCA samples
Synonymous (S) Count of silent somatic mutations
Somatic dN/dS The ratio (displayed as ∞ when S=0)
95% CI Confidence interval for the dN/dS estimate
FDR q-value Benjamini-Hochberg corrected p-value

Top known drivers in TCGA-BRCA:

Gene N S Somatic dN/dS Role
TP53 ~500+ ~14 ~35.9 Tumour suppressor; disables apoptosis and DNA damage checkpoints. Most mutated gene in human cancer.
PIK3CA ~350+ ~8 ~17.6 Oncogene; activating mutations in the PI3K signalling pathway drive cell growth and survival.
GATA3 99 2 19.8 Transcription factor for luminal breast differentiation; mutations alter luminal gene programmes.
CDH1 ~80+ ~3 ~14.0 E-cadherin; loss drives invasive lobular carcinoma through disrupted cell-cell adhesion.
FOXA1 34 1 13.6 Pioneer transcription factor; opens chromatin for oestrogen receptor binding. Mutations alter ER-driven transcription.
Hypothesis Assessment

Three coloured boxes present the formal hypothesis tests:

  • H1 (Germline Conservation): Tests whether ML-predictive genes have lower germline dN/dS than background. Assessed via permutation test (10,000 permutations) and Mann-Whitney U test.
  • H2 (Somatic Selection): Tests whether signature genes are enriched for somatic dN/dS > 1. Assessed via binomial enrichment.
  • H3 (Dual Pressure): Tests whether genes under both germline constraint AND somatic positive selection are more likely to be ML-predictive. This is the integrative hypothesis.
dN/dS Educational Box

An expandable accordion explains dN/dS for non-specialists with examples, analogies, and interpretation guidelines.


5.5 Results Page

URL: cancertranscriptomics.space/results

The Results page integrates all analyses into a final prioritised list of candidate cancer-dependency genes. It represents the culmination of the entire pipeline.

Three-Step Pipeline Visualisation

Three connected cards illustrate the filtering funnel:

Step 1: ML Signature         Step 2: Germline Filter        Step 3: Somatic Filter
132 genes identified    β†’    Genes with dN/dS < 0.3     β†’   Genes with somatic dN/dS > 1
by ML as predictive          (strong purifying selection)     AND FDR q < 0.05
of tumour state              = deeply conserved proteins      = positively selected in tumours

Summary Stat Cards

Card Value Colour Interpretation
Genes Tested ~13,000+ Grey Total genes in the master annotated table
ML Signature 132 Blue Genes identified by ML as discriminative
Conserved (dN/dS < 0.3) 88 Green ML genes under strong germline purifying selection
Somatically Selected Variable Red ML genes with somatic dN/dS > 1 AND FDR < 0.05
Final Candidates 25 Purple Genes passing ALL three filters simultaneously
% Purifying 96.4% Green Fraction of testable signature genes under purifying selection

Filtering Funnel

A visual funnel diagram shows how the gene count reduces at each filtering step: - Start: ~20,000 genes in genome - After ML: 132 signature genes - After germline filter: 88 with dN/dS < 0.3 - After somatic filter: 163 total candidates (across 5 cancer types; 15 cross-cancer validated) [UPDATED]

Candidates in Germline vs Somatic Space (Scatter Plot)

This scatter plot shows the final candidates for each cancer type. It highlights where each candidate falls in the germline-conservation Γ— somatic-selection space. Updated threshold (April 2026): somatic dN/dS β‰₯ 1.5 with 95% CI lower bound > 1.0 and FDR q < 0.05. [UPDATED]

  • X-axis: Germline dN/dS (all candidates have values < 0.3 by definition)
  • Y-axis: Somatic dN/dS (all candidates have values β‰₯ 1.5 by definition; ∞ values plotted at 25) [UPDATED]
  • Hover: Gene name, exact values, mutation counts, ML category

Key observation: Candidates cluster in the extreme upper-left corner β€” very low germline dN/dS (highly conserved) combined with very high or infinite somatic dN/dS (strongly positively selected). The stricter threshold (dN/dS β‰₯ 1.5 with CI lower > 1.0) ensures only high-confidence candidates pass through.

Biological Interpretation Section

A text section explaining what the candidate genes mean biologically: - These genes encode proteins essential for normal cellular function (evidenced by conservation) - Cancer cannot simply delete these genes β€” it needs their function - Instead, cancer modifies them through specific protein-altering mutations - This makes them potential therapeutic targets: drugs that restore normal protein function could selectively harm cancer cells

Candidate Gene Table

A sortable multi-column table listing all candidates for the selected cancer type:

Column Description How to Interpret
Gene Official HGNC symbol Clickable β€” opens detail panel
ML Category Which classification task(s) identified this gene Multi-category genes (e.g., "subtype_specific; Subtype") are more robust
Germline dN/dS Human-mouse dN/dS ratio Lower = more conserved. All candidates < 0.3
Mouse %ID Protein identity with mouse ortholog Higher = more conserved protein structure
Somatic dN/dS Somatic selection ratio (∞ if n_syn=0) Higher = stronger positive selection in tumours
Reliability Evidence strength badge "Strong" (β‰₯10 nonsyn + FDR<0.05), "Moderate" (5–9 nonsyn or intermediate FDR), "Weak" (<5 nonsyn or FDRβ‰₯0.05)
Nonsyn Count of nonsynonymous somatic mutations in TCGA-BRCA More mutations = more evidence (but also more common genes tend to accumulate more)
Syn Count of synonymous somatic mutations Zero synonymous β†’ infinite dN/dS. Low counts increase uncertainty.
FDR q Benjamini-Hochberg corrected p-value < 0.05 = statistically significant after multiple-testing correction
Priority Score Composite ranking score Higher = stronger candidate across all evidence dimensions
Reliability Categories Explained

The reliability badge reflects confidence in the somatic dN/dS estimate:

Badge Criteria Meaning
🟒 Strong β‰₯10 nonsynonymous mutations AND FDR q < 0.05 High confidence β€” enough mutations for a reliable ratio estimate, statistically significant
🟑 Moderate 5–9 nonsynonymous mutations, or borderline FDR Reasonable evidence but wider confidence intervals
πŸ”΄ Weak <5 nonsynonymous mutations OR FDR q β‰₯ 0.05 Low mutation count means the dN/dS estimate is highly uncertain. Infinite dN/dS with only 2 nonsynonymous mutations is suggestive but not conclusive.
Priority Score

The priority score is a composite ranking that combines multiple evidence dimensions into a single number for prioritisation. It is computed by rank-normalising each component to [0, 1] and summing:

  1. Germline conservation: 1/dN/dS (higher score for lower dN/dS = more conserved)
  2. Somatic selection: Somatic dN/dS value (higher score for higher selection)
  3. Expression change: |logβ‚‚ fold-change| tumour vs normal (if available)
  4. DepMap dependency: Negative mean CRISPR effect score (if available)

Higher priority score = stronger candidate across all evidence types.

Gene Detail Panel

Clicking any gene name in the table opens a slide-over panel showing: - Gene symbol and full name - ML category membership - All evolutionary metrics (germline dN/dS, protein identity, somatic dN/dS) - Mutation counts and statistical significance - A plain-English biological interpretation generated for that gene - Links to external databases (NCBI Gene, UniProt, COSMIC)

⚠️ Archive β€” Previous Pipeline Results (Pre-April 2026) The following table shows results from an earlier pipeline version using different thresholds (dN/dS > 1.0, raw p-value < 0.05). These have been superseded by the current results above. Retained for reference only.

The 25 Candidate Genes

Here is the complete candidate list with key metrics:

Gene Germline dN/dS Somatic dN/dS N S Priority ML Category
POU3F3 0.004 ∞ 2 0 1.56 Multi-cancer
SPP1 0.028 ∞ 4 0 1.52 Multi-cancer
FZD9 0.032 ∞ 4 0 1.44 subtype_specific
CCDC64 0.061 ∞ 2 0 1.36 Multi-cancer
TFDP2 0.066 ∞ 2 0 1.32 subtype_specific
MSX2 0.080 ∞ 3 0 1.28 subtype_specific
SPDEF 0.080 ∞ 4 0 1.24 subtype_specific
PAMR1 0.080 ∞ 7 0 1.20 Multi-cancer
FOXC1 0.085 ∞ 3 0 1.16 subtype_specific
ZMYND10 0.096 ∞ 4 0 1.12 subtype_specific
THSD4 0.113 ∞ 9 0 1.08 subtype_specific
SERPINE2 0.134 ∞ 2 0 1.04 Cross-cancer
GATA3 0.032 19.8 99 2 1.00 subtype_specific
ILDR2 0.138 ∞ 3 0 1.00 Multi-cancer
SCG2 0.151 ∞ 3 0 0.96 Cross-cancer
CILP2 0.170 ∞ 2 0 0.92 shared_multicancer
MFAP5 0.180 ∞ 4 0 0.88 shared_multicancer
FOXA1 0.059 13.6 34 1 0.88 subtype_specific
DKK4 0.184 ∞ 2 0 0.84 Cross-cancer
MYOC 0.187 ∞ 3 0 0.80 Multi-cancer
DSC2 0.197 ∞ 7 0 0.76 subtype_specific
AADACL2 0.204 ∞ 2 0 0.72 shared_multicancer
B3GNT5 0.232 ∞ 3 0 0.68 subtype_specific
F2RL3 0.244 ∞ 2 0 0.64 Cross-cancer
MMP27 0.254 ∞ 2 0 0.60 Multi-cancer

Caveats and Limitations Section

The Results page includes important caveats:

  1. Infinite somatic dN/dS: Most candidates have ∞ somatic dN/dS (zero synonymous mutations). While this is consistent with positive selection, it could also reflect low mutation counts. The "Reliability" column flags this uncertainty.

  2. Correlation β‰  Causation: ML identifies genes whose expression correlates with tumour state. This does not prove they cause cancer. Evolutionary analysis adds evidence but experimental validation is needed.

  3. BRCA-specific: Somatic analysis is specific to TCGA-BRCA. Candidates may not generalise to other cancer types.

  4. Bulk RNA-seq limitations: Measures average expression across all cells in a sample. Cannot resolve tumour heterogeneity or cell-type-specific effects.

Next Steps Section

Suggests future validation approaches: - CRISPR functional screens (DepMap integration) - Single-cell RNA-seq to resolve cell-type specificity - Pan-cancer somatic dN/dS to test cross-cancer generality - Drug target assessment via protein structure analysis


5.6 How It Works Page

URL: cancertranscriptomics.space/how-it-works

This page provides an interactive mind map visualising the entire analysis pipeline using D3.js force-directed graph layout.

Mind Map Structure

The central node connects to seven colour-coded branches:

Branch Colour Content
πŸ“₯ Data Acquisition Grey (#64748b) RNA-seq expression, somatic mutations, ortholog mappings, clinical metadata
βš™οΈ Preprocessing Blue (#3b82f6) Logβ‚‚(x+1) transform, variance filtering, Z-score standardisation, train/test split
πŸ€– ML Classification Blue (#3b82f6) Logistic Regression, Random Forest, Neural Network training and evaluation
✍️ Gene Signatures Purple (#6366f1) Feature importance extraction, top-50 selection, cross-model overlap, gene categories
🌿 Germline Conservation Green (#10b981) dN/dS calculation, Nei-Gojobori method, multi-species protein identity
πŸ”΄ Somatic Selection Red (#ef4444) Per-gene somatic dN/dS, binomial test, FDR correction, driver detection
πŸ”¬ Joint Analysis Purple (#6366f1) Germline Γ— somatic scatter, candidate prioritisation, hypothesis testing

Interactive Features

  • Click any node to see a detail panel with description and bullet points
  • Drag nodes to reposition them
  • Zoom and pan to explore the map
  • Expand/collapse child nodes with +/βˆ’ indicators
  • Colour-coded links highlight connections when a node is selected

Each node's detail panel provides educational content about that pipeline step, including formulas, parameter choices, and biological rationale.


5.7 Methods Page

URL: cancertranscriptomics.space/methods

The Methods page provides a comprehensive technical reference for all computational methods, organised as expandable accordion sections.

Section 1: Biological Rationale

Explains why each analysis is performed and how the three pillars (ML, germline, somatic) complement each other.

Section 2: Data Sources

Details every dataset used: TCGA RNA-seq, somatic mutation MAF files, Ensembl ortholog data, UCSC Xena browser.

Section 3: Preprocessing

Technical details on each preprocessing step:

  • Logβ‚‚(x+1) transform: Compresses dynamic range from 0–100,000+ to 0–17. The pseudocount (+1) prevents log(0) = undefined. Stabilises variance so high-expression genes don't dominate.
  • Variance filtering: Removes the bottom 20% of genes by variance. These genes show almost no variation across samples and carry no discriminative signal. Typically reduces features from ~20,000 to ~13,660.
  • Z-score standardisation: Per-gene: z = (x βˆ’ ΞΌ) / Οƒ using training-set statistics only (prevents data leakage). After standardisation, all genes have mean=0 and Οƒ=1, ensuring no gene dominates by absolute expression level. Critical for L2-regularised LR.
  • Stratified train/test split: 80/20 split preserving the tumour:normal class ratio. Fixed random seed (42) for reproducibility.

Section 4: ML Models

Full hyperparameters and architecture for each model:

  • LR: solver=lbfgs, C=1.0 (L2 regularisation strength), max_iter=1000
  • RF: n_estimators=500, max_features=sqrt, min_samples_leaf=5, class_weight=balanced
  • MLP: layers=[512,256,128], activation=ReLU, dropout=0.3, optimiser=Adam(lr=0.001), epochs=100 with early stopping (patience=10)

Section 5: Evaluation Metrics

Formal definitions of accuracy, precision, recall, ROC AUC, and F1-score with their mathematical formulas.

Section 6: Gene Signature Extraction

How signatures are derived from each model: - LR: Rank by |coefficient|, top 50 - RF: Rank by Gini importance (mean decrease in impurity), top 50 - MLP: Rank by mean |first-layer weights|, top 50 - Union across models: ~100–200 unique genes β†’ final set of 132

Section 7: Germline Evolutionary Analysis

Full Nei-Gojobori method with Jukes-Cantor correction (see Section 7 below for complete mathematical description).

Section 8: Somatic dN/dS Analysis

Complete method for per-gene somatic dN/dS: - Input: TCGA-BRCA MAF files (filtered to primary tumours, coding variants only) - Classification: 9 nonsynonymous variant types + synonymous (Silent) - Formula, binomial test, delta method CI, BH-FDR correction - Minimum 2 mutations per gene required

Section 9: Joint Germline-Somatic Analysis

How germline and somatic data are merged, visualised, and interpreted together.

Section 10: Classification Tasks Table

Complete description of all tasks with training/test set sizes.

Section 11: Limitations & Future Directions

Comprehensive list of current limitations and planned improvements.


6. Key Metrics & Statistics Explained

This section provides a reference for every numerical metric displayed on the website.

6.1 Machine Learning Metrics

ROC AUC (Area Under the Receiver Operating Characteristic Curve)

The ROC curve plots True Positive Rate (sensitivity/recall) vs False Positive Rate (1 βˆ’ specificity) at every possible classification threshold. AUC summarises this curve as a single number.

  • AUC = 1.0: Perfect classifier β€” 100% sensitivity at 0% false positive rate for some threshold
  • AUC = 0.5: Random classifier (coin flip)
  • AUC interpretation: The probability that a randomly chosen tumour sample is scored higher than a randomly chosen normal sample

Why AUC is the primary metric here: Unlike accuracy, AUC is not affected by class imbalance (there are ~10Γ— more tumour than normal samples in BRCA). It evaluates the model's ability to rank samples, not just classify them at a fixed threshold.

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Overall fraction of correct predictions. Simple but can be misleading with imbalanced classes.

Precision (Positive Predictive Value)

Precision = TP / (TP + FP)

"When the model says tumour, how often is it right?" High precision = few false alarms.

Recall (Sensitivity, True Positive Rate)

Recall = TP / (TP + FN)

"Of all real tumours, what fraction does the model catch?" High recall = few missed cancers.

F1-Score

F1 = 2 Γ— (Precision Γ— Recall) / (Precision + Recall)

Harmonic mean of precision and recall. Balances the trade-off between the two.

6.2 Feature Importance Metrics

Model Importance Metric Formula Interpretation
Logistic Regression Absolute coefficient |Ξ²α΅’| Direct linear contribution to tumour probability. Sign indicates direction (up/down in tumour).
Random Forest Gini importance Mean decrease in Gini impurity across all trees when gene i is used for splitting How much a gene reduces classification uncertainty when used in a decision.
MLP First-layer weight magnitude mean(|W₁ᡒ|) How much "attention" the network's first layer pays to each gene. Genes with large weights have the most influence on learned representations.

6.3 Evolutionary Metrics

Germline dN/dS

Ο‰ = dN / dS
where:
  dN = nonsynonymous divergence (corrected by Jukes-Cantor)
  dS = synonymous divergence (corrected by Jukes-Cantor)

See Section 7 for complete derivation.

Somatic dN/dS

Ο‰_somatic = (N / S) / ns_ratio
where:
  N = observed nonsynonymous mutations in TCGA-BRCA
  S = observed synonymous mutations
  ns_ratio β‰ˆ 2.5 (genome-wide ratio of nonsynonymous to synonymous sites)

Protein Sequence Identity (%)

%ID = (number of identical amino acids / alignment length) Γ— 100

Computed from Ensembl BioMart pairwise protein alignments between human and each ortholog species.

FDR q-value (Benjamini-Hochberg)

When testing thousands of genes simultaneously, some will appear significant by chance. The False Discovery Rate (FDR) controls the expected proportion of false positives among all genes declared significant.

q-value for gene ranked i (out of m total):
  q_i = p_i Γ— (m / i)
  • q < 0.05: We expect fewer than 5% of genes called "significant" to be false positives
  • q < 0.01: Fewer than 1% false positives expected

The BH procedure is less conservative than Bonferroni correction (which controls the probability of any false positive), making it more appropriate for genomic-scale analyses where we expect many true positives.

Cohen's d (Effect Size)

d = (mean₁ - meanβ‚‚) / pooled_standard_deviation

Measures the magnitude of difference between two groups (e.g., predictive vs background dN/dS) in standard deviation units:

d Value Interpretation
< 0.2 Negligible effect
0.2–0.5 Small effect
0.5–0.8 Medium effect
> 0.8 Large effect

Cohen's d is reported on the Evolution page for the germline dN/dS comparison between ML-predictive and background genes.

Permutation Test P-value

To test whether ML-predictive genes are more conserved than expected by chance:

  1. Calculate the observed mean dN/dS for predictive genes
  2. Randomly shuffle gene labels (predictive/background) 10,000 times
  3. Calculate mean dN/dS for the random "predictive" set each time
  4. p-value = fraction of permutations where random mean ≀ observed mean

This is a non-parametric test that makes no assumptions about the distribution of dN/dS values.


7. Evolutionary Analysis In Depth

7.1 Germline dN/dS: The Nei-Gojobori Method

The Nei-Gojobori (1986) method is used to compute dN/dS between human and mouse orthologs.

Step 1: Classify Sites

For each codon in the aligned sequences, determine how many of its 9 possible point mutations (3 positions Γ— 3 alternative nucleotides) are synonymous vs nonsynonymous. This gives the number of synonymous sites (S) and nonsynonymous sites (N) per codon. Sum across all codons for total S and N.

Example: The codon TTT (Phe): - Position 1: Any change β†’ different amino acid (3 nonsynonymous changes) - Position 2: Any change β†’ different amino acid (3 nonsynonymous changes) - Position 3: TTC (Phe) = synonymous; TTA (Leu) = nonsynonymous; TTG (Leu) = nonsynonymous - Total: 7 nonsynonymous sites/3, 2 synonymous sites/3 β†’ N=7/3, S=2/3 for this codon

Step 2: Count Differences

Compare human and mouse codons at each aligned position. Classify each difference as synonymous or nonsynonymous: sd (synonymous differences) and nd (nonsynonymous differences).

Step 3: Compute Proportions

pS = sd / S    (proportion of synonymous sites that differ)
pN = nd / N    (proportion of nonsynonymous sites that differ)

Step 4: Jukes-Cantor Correction

Over 90 million years, some sites have mutated multiple times. The Jukes-Cantor model corrects for these "multiple hits" that are invisible in pairwise comparison:

dS = -(3/4) Γ— ln(1 - (4/3) Γ— pS)
dN = -(3/4) Γ— ln(1 - (4/3) Γ— pN)

Why is this necessary? If two species have diverged for a very long time, a site that mutated A→G→C will appear as A→C (one change), hiding the intermediate step. The JC correction estimates the true number of substitutions from the observed proportion of differences. Without it, dN and dS would be systematically underestimated, especially for highly divergent sequences.

Step 5: Compute Ratio

Ο‰ = dN / dS

7.2 Somatic dN/dS Calculation

For each gene, using all coding somatic mutations observed across TCGA-BRCA samples:

Counts

N = number of nonsynonymous somatic mutations
S = number of synonymous somatic mutations

Nonsynonymous includes: Missense, Nonsense, Frame_Shift_Del, Frame_Shift_Ins, Splice_Site, In_Frame_Del, In_Frame_Ins, Nonstop, Translation_Start_Site.

Expected Ratio

Under neutral evolution, the ratio N/S should equal the ratio of nonsynonymous to synonymous sites in the genome:

ns_ratio β‰ˆ 2.5

This comes from the genetic code structure: approximately 71.5% of all possible point mutations in coding sequences are nonsynonymous, giving a ratio of ~2.5:1 nonsynonymous-to-synonymous sites.

dN/dS Formula

Ο‰_somatic = (N / S) / ns_ratio = (N / S) / 2.5

If S = 0 (no synonymous mutations observed), Ο‰ = ∞ (infinite).

Statistical Test

Null hypothesis: N/(N+S) = ns_ratio/(1+ns_ratio) = 2.5/3.5 β‰ˆ 0.714

Test: Exact binomial test β€” is the observed proportion of nonsynonymous mutations significantly higher than 0.714?

p_value = binom_test(N, N+S, p=ns_ratio/(1+ns_ratio), alternative='two-sided')

Confidence Interval

Using the delta method on log(N/S):

SE = sqrt(1/N + 1/S)
95% CI = exp(log(N/S) Β± 1.96 Γ— SE) / ns_ratio

When S = 0, the CI is undefined (logged as NaN).

Multiple Testing Correction

q_values = BH_FDR_correction(p_values)

Applied across all ~13,208 genes tested. Controls FDR at Ξ± = 0.05.

7.3 Multi-Species Conservation Analysis

Protein identity is measured across four species at different evolutionary distances:

Species Divergence (MYA) Biological Significance
Mouse ~90 Primary comparison β€” sufficient divergence to measure selection, close enough for reliable ortholog identification
Rat ~90 Independent replicate of the mouse comparison
Dog ~96 Slightly more distant; confirms patterns seen in rodents
Zebrafish ~435 Very distant β€” only the most ancient, universal functions show high conservation here

Genes that maintain high protein identity even at the zebrafish comparison (~435 million years) are under the most extreme evolutionary constraint β€” they perform functions so fundamental that they predate the divergence of fish and mammals.


8. Candidate Gene Profiles

⚠️ Archive β€” Previous Pipeline Results (Pre-April 2026) The following table shows results from an earlier pipeline version using different thresholds (dN/dS > 1.0, raw p-value < 0.05). These have been superseded by the current results above. Retained for reference only.

Detailed biological profiles of key candidate genes:

POU3F3 (Priority Score: 1.56 β€” Highest)

  • Full name: POU Class 3 Homeobox 3
  • Germline dN/dS: 0.004 (among the most conserved genes in the genome)
  • Mouse protein identity: 99.0%
  • Somatic dN/dS: ∞ (2 nonsynonymous, 0 synonymous)
  • Function: Transcription factor in neural development and cell differentiation
  • Why it matters: A protein 99% identical between human and mouse after 90 million years of evolution is under extraordinarily strong constraint. Its appearance as an ML-predictive gene suggests it plays a previously unrecognised role in cancer transcriptomics. The somatic mutations, while few, are 100% protein-altering.
  • Reliability: Weak (only 2 mutations β€” more data needed to confirm)
  • Category: Multi-cancer classification gene

GATA3 (Priority Score: 1.00 β€” Strong Evidence)

  • Full name: GATA Binding Protein 3
  • Germline dN/dS: 0.032 (very strongly conserved)
  • Somatic dN/dS: 19.8 (99 nonsynonymous, 2 synonymous)
  • FDR q: 9.4 Γ— 10⁻¹² (extremely significant)
  • Function: Master transcription factor for luminal breast epithelial differentiation. Directly regulates ESR1 (oestrogen receptor) expression. Mutations cluster in the zinc finger DNA-binding domain and C-terminal transactivation domain.
  • Why it matters: GATA3 is the third most frequently mutated gene in breast cancer. Its mutations alter luminal differentiation programmes, and it is strongly conserved across 90 million years of evolution (dN/dS = 0.032 means only 3.2% as many amino acid changes as expected under neutrality). This gene perfectly exemplifies the cancer-dependency hypothesis: an essential transcription factor whose modification drives breast cancer biology.
  • Reliability: Strong (99 nonsynonymous mutations, FDR < 10⁻¹¹)
  • Category: Subtype-specific (Luminal A vs Basal-like discriminator)

FOXA1 (Priority Score: 0.88 β€” Strong Evidence)

  • Full name: Forkhead Box A1
  • Germline dN/dS: 0.059 (very strongly conserved)
  • Somatic dN/dS: 13.6 (34 nonsynonymous, 1 synonymous)
  • FDR q: 6.3 Γ— 10⁻⁴ (highly significant)
  • Function: Pioneer transcription factor that opens chromatin to enable oestrogen receptor binding. Mutations in breast cancer cluster in the forkhead domain and alter ER-dependent gene programmes. FOXA1 mutations are mutually exclusive with GATA3 mutations, suggesting they affect the same pathway.
  • Why it matters: FOXA1 is a "pioneer factor" β€” it physically opens tightly packed chromatin to allow other transcription factors (especially ER) access to their target genes. It is conserved to dN/dS = 0.059 (94.1% of amino acid changes removed by selection). Cancer specifically mutates this chromatin gateway to reprogram gene expression.
  • Reliability: Strong (34 nonsynonymous mutations, FDR < 0.001)
  • Category: Subtype-specific

SPP1 (Priority Score: 1.52 β€” Second Highest)

  • Full name: Secreted Phosphoprotein 1 (Osteopontin)
  • Germline dN/dS: 0.028 (strongly conserved)
  • Somatic dN/dS: ∞ (4 nonsynonymous, 0 synonymous)
  • Function: Extracellular matrix glycoprotein involved in cell adhesion, migration, and immune modulation. Overexpressed in many cancers and promotes metastasis.
  • Why it matters: SPP1/Osteopontin is a well-known metastasis promoter. Its extreme conservation (dN/dS = 0.028) reflects its essential role in tissue remodelling and immune signalling. All four somatic mutations are protein-altering, suggesting selective pressure on its protein function in tumours.
  • Reliability: Weak (only 4 mutations, but all nonsynonymous)

FOXC1 (Priority Score: 1.16)

  • Full name: Forkhead Box C1
  • Germline dN/dS: 0.085 (strongly conserved)
  • Somatic dN/dS: ∞ (3 nonsynonymous, 0 synonymous)
  • Function: Transcription factor critical for mesenchymal differentiation, neural crest development, and vascular formation. Overexpression is associated with basal-like breast cancer and poor prognosis.
  • Why it matters: FOXC1 marks the basal-like subtype β€” the most aggressive form of breast cancer. Its conservation reflects essential developmental functions. Its role in epithelial-mesenchymal transition (EMT) connects it directly to cancer invasion and metastasis.
  • Category: Subtype-specific

MSX2 (Priority Score: 1.28)

  • Full name: Msh Homeobox 2
  • Germline dN/dS: 0.080 (strongly conserved)
  • Somatic dN/dS: ∞ (3 nonsynonymous, 0 synonymous)
  • Function: Homeobox transcription factor involved in limb and craniofacial development, bone morphogenesis, and mammary gland development.
  • Why it matters: MSX2 plays a role in mammary gland development and has been implicated in breast cancer cell proliferation and apoptosis resistance. Its strong conservation underscores its developmental importance.
  • Category: Subtype-specific

9. Validation Framework [NEW]

The pipeline now incorporates multiple orthogonal validation layers to ensure candidate genes are biologically meaningful and not artefacts of data processing.

9.1 Cross-Cancer Validation

Genes identified as candidates in β‰₯2 independent cancer types are flagged as cross-cancer validated. This addresses the concern that single-cohort findings may be idiosyncratic.

Metric Value
Total candidates across 5 cancer types 163
Cross-cancer validated (β‰₯2 types) 15 genes
Cancer types analysed BRCA, BLCA, PRAD, LUAD, UCEC

9.2 Kaplan-Meier Survival Analysis

For each candidate gene, patients are stratified into high vs low expression groups (median split), and survival curves are compared using the log-rank test.

  • Library: lifelines (Python)
  • Output: logrank p-value, hazard ratio estimate, median survival times
  • Validation threshold: p < 0.05 (with Bonferroni correction for multiple comparisons)

Genes where high expression correlates with worse survival provide additional evidence for clinical relevance.

9.3 GSEA Pathway Enrichment

Gene Set Enrichment Analysis tests whether candidate genes are enriched in known biological pathways.

  • Databases: Reactome, KEGG, GO Biological Process
  • Library: gseapy
  • Output: Top 5 enriched pathways per cancer type, normalized enrichment scores, FDR q-values

Signatures where no coherent pathway emerges (all q > 0.1) are flagged as "biologically diffuse" for review.

9.4 Multi-Omics Convergence

Cross-validation against orthogonal TCGA data types strengthens confidence in expression-based findings.

Evidence Layer Source Expected Pattern
Copy Number Variation TCGA CNV Amplification/deletion correlates with expression
DNA Methylation TCGA 450K/EPIC array Promoter hypermethylation inversely correlates with expression

Convergence Score (0–3): - 0 = Expression only - 1 = Expression + CNV - 2 = Expression + Methylation - 3 = Expression + CNV + Methylation (most trusted)

Genes with convergence score β‰₯ 2 are prioritised in downstream analyses.

9.5 External Database Cross-Reference

Candidates are automatically annotated against curated cancer gene databases:

Database Description Evidence Value
COSMIC Cancer Gene Census 700+ genes with mechanistic evidence for cancer driver roles Known oncogene/TSG
OncoKB Clinically annotated cancer genes with therapeutic implications Actionable mutation data

Output columns: known_oncogene (bool), evidence_source ('COSMIC', 'OncoKB', 'both', 'novel')

9.6 Updated Filtering Thresholds [April 2026]

Parameter Old Value New Value Rationale
Somatic dN/dS threshold > 1.0 β‰₯ 1.5 Reduces false positives from neutral drift
CI requirement None Lower bound > 1.0 Ensures statistical confidence
FDR threshold < 0.05 < 0.05 (unchanged) Standard significance level
Germline dN/dS < 0.3 < 0.3 (unchanged) Strong purifying selection

10. Statistical Methods Reference

9.1 Permutation Testing

Used to test whether ML-predictive genes have significantly different dN/dS from random genes.

Procedure: 1. Observe the true difference in mean dN/dS between predictive (n=110) and background (n=465) gene sets 2. Randomly reassign "predictive" and "background" labels 10,000 times (preserving group sizes) 3. Compute the difference in means for each permutation 4. p-value = (number of permutations with difference β‰₯ observed) / 10,000

Advantage: Makes no assumptions about the distribution of dN/dS values (non-parametric). Robust to outliers and skewed distributions.

9.2 Mann-Whitney U Test

A non-parametric test comparing the ranks (not values) of two groups. Used as a complement to the permutation test for comparing dN/dS distributions.

Null hypothesis: The probability that a randomly chosen predictive gene has lower dN/dS than a randomly chosen background gene equals 50%.

9.3 Binomial Exact Test (Somatic dN/dS)

Tests whether the proportion of nonsynonymous mutations for a gene deviates from the neutral expectation.

Hβ‚€: P(nonsynonymous) = ns_ratio / (1 + ns_ratio) β‰ˆ 0.714
H₁: P(nonsynonymous) β‰  0.714  (two-sided)

Test statistic: N successes in (N + S) trials
Distribution: Binomial(N+S, 0.714) under Hβ‚€

9.4 Benjamini-Hochberg FDR Correction

For m genes tested: 1. Sort p-values: pβ‚β‚β‚Ž ≀ pβ‚β‚‚β‚Ž ≀ ... ≀ pβ‚β‚˜β‚Ž 2. For gene ranked i: qβ‚α΅’β‚Ž = min(pβ‚α΅’β‚Ž Γ— m/i, qβ‚α΅’β‚Šβ‚β‚Ž) 3. Starting from the largest, enforce monotonicity: qβ‚α΅’β‚Ž = min(qβ‚α΅’β‚Ž, qβ‚α΅’β‚Šβ‚β‚Ž)

Interpretation: At FDR = 0.05, we accept that ~5% of genes declared significant may be false positives. With 4,615 significant genes, we expect ~231 false positives and ~4,384 true positives.

9.5 Delta Method (Confidence Intervals for dN/dS)

For somatic dN/dS = (N/S)/ns_ratio:

log(N/S) is approximately normal for large N, S
Var(log(N/S)) β‰ˆ 1/N + 1/S
SE = sqrt(1/N + 1/S)

95% CI for dN/dS:
  Lower = exp(log(N/S) - 1.96 Γ— SE) / ns_ratio
  Upper = exp(log(N/S) + 1.96 Γ— SE) / ns_ratio

When S = 0, the CI is undefined because log(N/0) = ∞.


11. Limitations & Caveats

10.1 Data Limitations

  1. Bulk RNA-seq: Measures average gene expression across all cells in a tissue sample. Cannot distinguish tumour cell expression from stromal, immune, or vascular cell contributions. Single-cell RNA-seq would provide finer resolution.

  2. TCGA cohort biases: TCGA samples are predominantly from North American patients and may not represent global genetic diversity. Treatment-naΓ―ve primary tumours only β€” does not capture metastatic or treated disease.

  3. Somatic mutation calling: Depends on the specific variant-calling pipeline used by TCGA. Different pipelines can produce different mutation lists, particularly for indels and low-frequency variants.

10.2 Methodological Limitations

  1. Germline dN/dS assumptions: The Nei-Gojobori method assumes a single substitution rate across the gene. In reality, different protein domains evolve at different rates. The Jukes-Cantor model assumes equal substitution rates among all nucleotides, which is a simplification.

  2. Somatic ns_ratio = 2.5: This genome-wide average may not be accurate for individual genes. Genes with unusual codon usage patterns may have different expected N/S ratios. More sophisticated methods (e.g., dNdScv) account for gene-specific mutational context.

  3. Infinite dN/dS: In the current pipeline, only 4 of 163 candidates have zero synonymous mutations (TP53-PRAD, PTEN-BRCA, KRAS-LUAD, HNRNPD-UCEC), all of which are established cancer drivers. The previous pipeline version (pre-April 2026) had 23/25 candidates with S=0, which was addressed by switching to FDR-based filtering.

  4. Multiple testing at gene level: While FDR correction is applied within the somatic analysis, the overall analysis pipeline involves many choices (which genes, which thresholds, which models) that collectively inflate the risk of finding patterns by chance.

10.3 Interpretation Limitations

  1. Correlation vs causation: ML identifies genes whose expression correlates with tumour state. Some may be consequences of cancer (reactive changes in surrounding tissue) rather than causes.

  2. Germline β‰  somatic function: A gene conserved in the germline is important for the organism, but its role in cancer may be completely different from its normal function. Conservation tells us the protein matters, not how cancer uses it.

  3. Cancer type specificity: Results are primarily driven by TCGA-BRCA data. Candidate genes may not be relevant to other cancer types. Cross-cancer validation is planned but not yet implemented for somatic analysis.


12. Glossary

Term Definition
AUC Area Under the Curve β€” summary measure of ROC curve performance (0.5 = random, 1.0 = perfect)
Basal-like Aggressive breast cancer subtype; triple-negative (ERβˆ’, PRβˆ’, HER2βˆ’), high proliferation
BH-FDR Benjamini-Hochberg False Discovery Rate β€” multiple testing correction method
BRCA Breast invasive carcinoma (TCGA project code)
Cancer dependency A gene whose function cancer cells require for survival; potential drug target
CDS Coding DNA Sequence β€” the portion of a gene that encodes protein
Cohen's d Effect size measure; difference between group means in standard deviation units
Codon Three-nucleotide unit of DNA/RNA that specifies one amino acid
dN Rate of nonsynonymous substitutions per nonsynonymous site
dN/dS (Ο‰) Ratio of nonsynonymous to synonymous substitution rates; measures selection pressure
dS Rate of synonymous substitutions per synonymous site
Driver mutation Somatic mutation that confers growth advantage to cancer cells
Dropout Neural network regularisation: randomly disables neurons during training to prevent overfitting
FDR False Discovery Rate β€” expected proportion of false positives among declared significant results
Feature importance How much a gene contributes to ML model predictions
Gini importance Random Forest metric: mean decrease in Gini impurity when a gene is used for tree splitting
Germline Inherited genetic material; germline dN/dS measures selection across species evolution
Hallmarks of cancer Set of biological capabilities acquired by cancer cells (Hanahan & Weinberg)
HGNC HUGO Gene Nomenclature Committee β€” assigns official gene symbols
Jukes-Cantor Statistical model correcting for unobserved multiple substitutions at the same site
L2 regularisation Penalises large model coefficients to prevent overfitting; shrinks weights toward zero
Logβ‚‚ fold-change logβ‚‚(tumour expression / normal expression); measures magnitude and direction of expression change
LUAD Lung adenocarcinoma (TCGA project code)
Luminal A Breast cancer subtype; ER+, PR+, HER2βˆ’, low proliferation, best prognosis
MAF Mutation Annotation Format β€” standard file format for somatic mutation data
MLP Multi-Layer Perceptron β€” feedforward neural network with multiple hidden layers
MYA Million Years Ago β€” unit for evolutionary divergence time
Nei-Gojobori Method for computing dN/dS from pairwise sequence alignments
Nonsynonymous Mutation that changes the encoded amino acid (protein-altering)
ns_ratio Genome-wide ratio of nonsynonymous to synonymous sites (~2.5)
Ortholog Genes in different species derived from a common ancestral gene
PAM50 50-gene panel used to classify breast cancer molecular subtypes
Passenger mutation Somatic mutation with no effect on cancer fitness; neutral hitchhiker
Permutation test Non-parametric significance test using random label shuffling
Positive selection Evolutionary process favouring advantageous mutations (dN/dS > 1)
Priority score Composite ranking combining germline conservation, somatic selection, and expression data
Purifying selection Evolutionary process removing harmful mutations (dN/dS < 1)
q-value FDR-adjusted p-value; probability that a result this extreme is a false positive
RNA-seq RNA sequencing β€” high-throughput measurement of gene expression levels
ROC Receiver Operating Characteristic β€” curve plotting sensitivity vs false positive rate
RSEM RNA-Seq by Expectation Maximization β€” method for quantifying gene expression
Somatic Mutations arising in body cells (not inherited); somatic dN/dS measures selection in tumours
Stratified split Train/test partition preserving class proportions
Synonymous Mutation that does NOT change the encoded amino acid (silent/neutral)
TCGA The Cancer Genome Atlas β€” NIH-funded multi-cancer molecular characterisation project
Z-score Standardised value: (x βˆ’ mean) / standard deviation; centres data at 0 with unit variance

This documentation was generated for cancertranscriptomics.space. For questions or feedback, contact Polat BakΔ±r.