Methods
Complete methodology for the cancer transcriptomics ML analysis pipeline.
RNA-seq HTSeq counts were obtained from the TCGA GDC portal for five cancer types:
| Cancer Type | Total | Tumor | Normal | Ratio |
|---|---|---|---|---|
| BRCA (Breast) | 1,218 | 1,104 | 114 | 9.7:1 |
| BLCA (Bladder) | 426 | 407 | 19 | 21.4:1 |
| PRAD (Prostate) | 550 | 498 | 52 | 9.6:1 |
| LUAD (Lung Adenocarcinoma) | 576 | 517 | 59 | 8.8:1 |
| UCEC (Uterine) | 201 | 177 | 24 | 7.4:1 |
DESeq2 pre-filtering: Genes were retained only if |log2FC| > 1.5 with Benjamini–Hochberg adjusted p < 0.05, leaving ~13,660 genes for downstream modelling.
Class balancing: SMOTE oversampling is applied within each CV fold for cancer types with severe class imbalance (PRAD: 9.6:1, BLCA: 21.4:1). For other cancers, class-weight balancing is used instead.
Batch correction: ComBat batch correction was applied for PRAD to address adjacent-normal heterogeneity between sequencing batches.
Three complementary model types are trained per cancer type using 5-fold stratified cross-validation:
| Model | Hyperparameters | Feature Importance Method |
|---|---|---|
| Logistic Regression (L2) | L2 penalty, C=1.0 | |coefficients| |
| Random Forest | 100–500 trees, max_depth=None | Gini importance |
| MLP Neural Network | Dynamic architecture (see below) | gradient × input saliency |
Dynamic MLP architecture:
- 512 → 256 → 128 neurons when n > 600 samples
- 256 → 128 neurons when n ≤ 600 samples
BatchNorm1d is applied between each hidden layer.
BCEWithLogitsLoss to focus
training on hard-to-classify normal samples, improving specificity for imbalanced cohorts.For each cancer type the gene signature is constructed as the union of top-N genes across all three models (LR, RF, MLP).
- Pseudogene blacklist filter: Genes annotated as pseudogenes in Ensembl (biotype filtering) are removed before ranking.
- Importance renormalisation: After filtering, importance scores are renormalised so they sum to 1.0 within each model.
Cross-species comparison spanning ~90–400 Myr of divergence is used to quantify purifying selection on protein-coding genes.
Species panel: mouse, rat, dog, cow, opossum, zebrafish.
A weighted mean dN/dS is computed across species, weighted by divergence time. Genes with dN/dS < 0.3 are classified as under purifying selection, indicating they are functionally constrained and likely essential.
Somatic dN/dS is calculated using a binomial exact test comparing observed nonsynonymous mutations to expected counts under neutral evolution (expected nonsynonymous proportion = 2.85/(1+2.85) ≈ 0.74). FDR correction (Benjamini–Hochberg) is applied to genes with dN/dS > 1. This is a simplified approach compared to the dNdScv method (Martincorena et al., 2017) which accounts for gene-specific covariates.
Genes under positive somatic selection must satisfy all three criteria:
- dN/dS ≥ 1.5
- 95% CI lower bound > 1.0
- FDR q < 0.05 (TMB-adaptive: < 0.01 for hypermutated cancers)
| Threshold | Old Value | New Value | Rationale |
|---|---|---|---|
| dN/dS minimum | 1.0 | 1.5 | Reduces false positives from near-neutral genes |
| CI lower bound | — | > 1.0 | Ensures statistical robustness |
| FDR threshold | 0.05 | 0.05 (0.01 for high-TMB) | TMB-adaptive: stricter threshold for hypermutated cancers (e.g., UCEC) |
Candidate cancer dependencies are identified at the intersection of three evidence layers:
- ML-predictive — gene appears in the top-N signature
- Germline conserved — dN/dS < 0.3 across species
- Somatic selected — dN/dS ≥ 1.5, CI > 1.0, FDR < 0.05 (FDR < 0.01 for high-TMB cancers)
Cross-cancer validation: Genes appearing in ≥ 2 cancer types receive higher confidence. Priority scoring is based on multi-criteria ranking across all three layers.
- Balanced accuracy — primary classification metric (handles class imbalance by averaging per-class recall).
- MCC (Matthews Correlation Coefficient) — single-number measure of binary classification quality that accounts for all four confusion-matrix cells.
- Benjamini–Hochberg FDR correction applied to all multiple-testing scenarios (DESeq2, somatic dN/dS).
- 95% confidence intervals for somatic dN/dS estimates, computed via profile likelihood.
- Random seed = 42 for all stochastic operations (train/test splits, model initialisation, SMOTE).
- All thresholds centralised in
config.py— no magic numbers in pipeline code. - Results are namespaced by cancer type (e.g.
results/TCGA-BRCA/), enabling independent re-runs per cohort.
- Bulk RNA-seq only — does not capture single-cell heterogeneity within tumour or stromal compartments.
- Limited normal samples for some cancer types (BLCA: 19 normals, UCEC: 24 normals), mitigated by SMOTE but not eliminated.
- Somatic dN/dS depends on mutation count — low-mutation genes produce wide confidence intervals and may be missed.
- Cross-species dN/dS may miss lineage-specific functional constraints that arose after the last common ancestor.
- PRAD under-powered — prostate cancer has the lowest TMB in our cohort (median 2 nonsyn/gene), yielding only 1 candidate (TP53). Adjacent-normal tissue contamination also reduces classifier specificity.
- Near-perfect AUC — AUC ≥ 0.999 for several cancers reflects the fundamental transcriptomic difference between tumour and normal tissue, not overfitting. 5-fold stratified CV with SMOTE applied only within folds prevents data leakage.
- UCEC hypermutation — elevated TMB (median 37 nonsyn/gene) inflates the number of genes reaching statistical significance. TMB-adaptive FDR (q < 0.01) partially addresses this but 116 candidates should be interpreted cautiously.
- Infinite dN/dS — genes with zero synonymous mutations yield dN/dS = ∞. These are retained when FDR is significant (e.g., TP53 in PRAD: 57 nonsyn, 0 syn), as the statistical test accounts for mutation counts.