Pipeline Architecture
Interactive visualisation of the complete analysis pipeline — from data acquisition through ML classification to evolutionary candidate identification.
📋 Pipeline Summary
Download TCGA RNA-seq HTSeq counts and MAF mutation files for 5 cancer types via GDC API. DESeq2 pre-filter to ~13,660 genes.
Train LR, RF, and MLP classifiers with 5-fold CV. Extract feature importance signatures. Union top genes across models.
Germline dN/dS < 0.3 (purifying selection) + Somatic dN/dS ≥ 1.5 with FDR < 0.05 (positive selection). Intersection = candidates.
⚙️ Pipeline Improvements (April 2026)
FocalLoss
Replaces BCEWithLogitsLoss. Forces model to focus on hard-to-classify normal samples (α=0.25, γ=2.0).
SMOTE Oversampling
Synthetic minority oversampling for PRAD and BLCA normal class to address class imbalance.
DESeq2 Pre-filter
Differential expression filter: |log2FC| > 1.5, BH-adjusted p < 0.05. Retains ~13,660 informative genes.
Pseudogene Blacklist
Removes processed/unprocessed pseudogenes from all signatures using Ensembl biotype annotations.
ComBat Correction
Batch correction for PRAD adjacent-normal tissue heterogeneity using ComBat-seq.
Dynamic Architecture
MLP auto-selects 512→256→128 for n>600, 256→128 for smaller datasets + BatchNorm.