Methods — Cancer Transcriptomics ML

1. Data Sources & Preprocessing

RNA-seq HTSeq counts were obtained from the TCGA GDC portal for five cancer types:

Cancer Type	Total	Tumor	Normal	Ratio
BRCA (Breast)	1,218	1,104	114	9.7:1
BLCA (Bladder)	426	407	19	21.4:1
PRAD (Prostate)	550	498	52	9.6:1
LUAD (Lung Adenocarcinoma)	576	517	59	8.8:1
UCEC (Uterine)	201	177	24	7.4:1

DESeq2 pre-filtering: Genes were retained only if |log₂FC| > 1.5 with Benjamini–Hochberg adjusted p < 0.05, leaving ~13,660 genes for downstream modelling.

Class balancing: SMOTE oversampling is applied within each CV fold for cancer types with severe class imbalance (PRAD: 9.6:1, BLCA: 21.4:1). For other cancers, class-weight balancing is used instead.

Batch correction: ComBat batch correction was applied for PRAD to address adjacent-normal heterogeneity between sequencing batches.

2. ML Models

Three complementary model types are trained per cancer type using 5-fold stratified cross-validation:

Model	Hyperparameters	Feature Importance Method
Logistic Regression (L2)	L2 penalty, C=1.0	\|coefficients\|
Random Forest	100–500 trees, max_depth=None	Gini importance
MLP Neural Network	Dynamic architecture (see below)	gradient × input saliency

Dynamic MLP architecture:

512 → 256 → 128 neurons when n > 600 samples
256 → 128 neurons when n ≤ 600 samples

BatchNorm1d is applied between each hidden layer.

🎯

FocalLoss (α=0.25, γ=2.0) replaces BCEWithLogitsLoss to focus training on hard-to-classify normal samples, improving specificity for imbalanced cohorts.

3. Gene Signature Extraction

For each cancer type the gene signature is constructed as the union of top-N genes across all three models (LR, RF, MLP).

Pseudogene blacklist filter: Genes annotated as pseudogenes in Ensembl (biotype filtering) are removed before ranking.
Importance renormalisation: After filtering, importance scores are renormalised so they sum to 1.0 within each model.

4. Germline dN/dS (Conservation)

Cross-species comparison spanning ~90–400 Myr of divergence is used to quantify purifying selection on protein-coding genes.

Species panel: mouse, rat, dog, cow, opossum, zebrafish.

A weighted mean dN/dS is computed across species, weighted by divergence time. Genes with dN/dS < 0.3 are classified as under purifying selection, indicating they are functionally constrained and likely essential.

5. Somatic dN/dS (Selection)

Somatic dN/dS is calculated using a binomial exact test comparing observed nonsynonymous mutations to expected counts under neutral evolution (expected nonsynonymous proportion = 2.85/(1+2.85) ≈ 0.74). FDR correction (Benjamini–Hochberg) is applied to genes with dN/dS > 1. This is a simplified approach compared to the dNdScv method (Martincorena et al., 2017) which accounts for gene-specific covariates.

Genes under positive somatic selection must satisfy all three criteria:

dN/dS ≥ 1.5
95% CI lower bound > 1.0
FDR q < 0.05 (TMB-adaptive: < 0.01 for hypermutated cancers)

Threshold	Old Value	New Value	Rationale
dN/dS minimum	1.0	1.5	Reduces false positives from near-neutral genes
CI lower bound	—	> 1.0	Ensures statistical robustness
FDR threshold	0.05	0.05 (0.01 for high-TMB)	TMB-adaptive: stricter threshold for hypermutated cancers (e.g., UCEC)

6. Integration & Candidate Identification

Candidate cancer dependencies are identified at the intersection of three evidence layers:

ML-predictive — gene appears in the top-N signature
Germline conserved — dN/dS < 0.3 across species
Somatic selected — dN/dS ≥ 1.5, CI > 1.0, FDR < 0.05 (FDR < 0.01 for high-TMB cancers)

Cross-cancer validation: Genes appearing in ≥ 2 cancer types receive higher confidence. Priority scoring is based on multi-criteria ranking across all three layers.

7. Statistical Framework

Balanced accuracy — primary classification metric (handles class imbalance by averaging per-class recall).
MCC (Matthews Correlation Coefficient) — single-number measure of binary classification quality that accounts for all four confusion-matrix cells.
Benjamini–Hochberg FDR correction applied to all multiple-testing scenarios (DESeq2, somatic dN/dS).
95% confidence intervals for somatic dN/dS estimates, computed via profile likelihood.

8. Reproducibility

Random seed = 42 for all stochastic operations (train/test splits, model initialisation, SMOTE).
All thresholds centralised in config.py — no magic numbers in pipeline code.
Results are namespaced by cancer type (e.g. results/TCGA-BRCA/), enabling independent re-runs per cohort.

9. Limitations

Bulk RNA-seq only — does not capture single-cell heterogeneity within tumour or stromal compartments.
Limited normal samples for some cancer types (BLCA: 19 normals, UCEC: 24 normals), mitigated by SMOTE but not eliminated.
Somatic dN/dS depends on mutation count — low-mutation genes produce wide confidence intervals and may be missed.
Cross-species dN/dS may miss lineage-specific functional constraints that arose after the last common ancestor.
PRAD under-powered — prostate cancer has the lowest TMB in our cohort (median 2 nonsyn/gene), yielding only 1 candidate(s).
UCEC hypermutation — elevated TMB (median 37 nonsyn/gene) inflates the number of genes reaching statistical significance. TMB-adaptive FDR (q < 0.01) partially addresses this but 116 candidates should be interpreted cautiously.
Infinite dN/dS — genes with zero synonymous mutations yield dN/dS = ∞. These are retained when FDR is significant (e.g., TP53 in PRAD: 57 nonsyn, 0 syn), as the statistical test accounts for mutation counts.