Data Sources
ML Pipeline
Germline Evolution
Somatic Selection
Integration

📋 Pipeline Summary

Stage 1: Data Acquisition

Download TCGA RNA-seq HTSeq counts and MAF mutation files for 5 cancer types via GDC API. DESeq2 pre-filter to ~13,660 genes.

Stage 2: ML Classification

Train LR, RF, and MLP classifiers with 5-fold CV. Extract feature importance signatures. Union top genes across models.

Stage 3: Evolutionary Filtering

Germline dN/dS < 0.3 (purifying selection) + Somatic dN/dS ≥ 1.5 with FDR < 0.05 (positive selection). Intersection = candidates.

⚙️ Pipeline Improvements (April 2026)

🎯

FocalLoss

Replaces BCEWithLogitsLoss. Forces model to focus on hard-to-classify normal samples (α=0.25, γ=2.0).

⚖️

SMOTE Oversampling

Synthetic minority oversampling for PRAD and BLCA normal class to address class imbalance.

🧬

DESeq2 Pre-filter

Differential expression filter: |log2FC| > 1.5, BH-adjusted p < 0.05. Retains ~13,660 informative genes.

🚫

Pseudogene Blacklist

Removes processed/unprocessed pseudogenes from all signatures using Ensembl biotype annotations.

🔧

ComBat Correction

Batch correction for PRAD adjacent-normal tissue heterogeneity using ComBat-seq.

📐

Dynamic Architecture

MLP auto-selects 512→256→128 for n>600, 256→128 for smaller datasets + BatchNorm.