Machine Learning–Guided Identification of Cancer-Maintaining Gene Dependencies Through Two-Scale Evolutionary Filtering of TCGA Transcriptomes

Figures and Legends

Figure 1
Loading figure data…

Figure 1. Analytical pipeline overview. RNA-seq counts and MAF files for TCGA cohorts () were obtained via GDC API. DESeq2 pre-filtering retained genes with |log₂FC| > 1.5 and BH-adjusted p < 0.05. Three classifiers (LR, RF, MLP) were trained with 5-fold stratified CV; union feature-importance signatures were passed through two evolutionary filters: germline purifying selection (Ensembl Compara dN/dS < 0.3) and somatic positive selection (binomial test, dN/dS ≥ 1.5, FDR < 0.05). Cross-cancer genes were defined as candidates in ≥ 2 cohorts.

Figure 2
Loading figure data…

Figure 2. TCGA cohort composition and biological context. (A) Sample counts across cancer types. SMOTE oversampling was applied to PRAD and BLCA normal classes within CV folds; ComBat-seq batch correction was used for PRAD adjacent-normal samples. (B) Protein-coding genes retained after DESeq2 filtering (|log₂FC| > 1.5, BH-adjusted p < 0.05) and pseudogene removal. (C) Cohort-level summary statistics. (D) Cancer-type biological context.

Figure 3
Loading figure data…

Figure 3. ML classifier performance across cohorts. (A) MLP classification metrics (5-fold stratified CV). (B) Specificity gain from baseline to optimised MLP (FocalLoss, α = 0.25, γ = 2.0). (C) MLP architecture and sample sizes per cohort.

Figure 4
Loading figure data…

Figure 4. Confusion matrices and filtering funnels. (A) Normalised confusion matrices per cohort (rows = true class, columns = predicted). PRAD normal-class recall is %; UCEC achieves full specificity with samples. (B) Gene filtering funnel across all five cohorts.

Figure 5
Loading figure data…

Figure 5. Candidate gene biology and pathway context. (A) Germline (blue) and somatic (red) dN/dS for BRCA candidates. TP53: germline , somatic ; PTEN: germline , somatic ∞. (B) Pathway grouping of cross-cancer validated genes. (C) Known vs. novel candidate composition per cancer type.

Figure 6
Loading figure data…

Figure 6. Cross-cancer validated genes. (A) genes identified in ≥ 2 cohorts. Bubble size reflects total non-synonymous mutations; TP53 appears in all five cohorts. (B) Binary co-occurrence matrix (gene × cancer). (C) Germline dN/dS of cross-cancer genes, ranked by conservation.

Figure 7
Loading figure data…

Figure 7. Two-scale evolutionary landscape. (A) Germline dN/dS (x) vs. somatic dN/dS (y). Shaded quadrant marks the candidate region (x < 0.3, y ≥ 1.5). Note: the binomial test over-estimates positive selection; thresholds were raised to dN/dS ≥ 1.5 (CI lower bound > 1.0) accordingly. (B) Aggregate filtering funnel: ~ DESeq2 genes → candidates across cohorts.

Figure 8
Loading figure data…

Figure 8. Candidate portfolio and clinical context. (A) Candidates per cohort; UCEC count (n = ) reflects MSI-driven hypermutation. (B) Somatic dN/dS matrix for genes in ≥ 2 cohorts. (C) Druggability of cross-cancer candidates. (D) GSEA pathway enrichment for BRCA candidates.