Business Context
You’re joining the ML team at OncoDx, a healthcare diagnostics company that partners with 40+ hospital systems to deliver tumor-of-origin predictions for patients with ambiguous pathology. The product ingests RNA-seq from biopsy samples and returns a ranked list of likely cancer types to guide treatment selection. A wrong prediction can lead to inappropriate therapy, while a “no-call” result delays care—so the model must be accurate, calibrated, and auditable for clinical review.
OncoDx has accumulated a large retrospective dataset and wants you to propose techniques to analyze genomic sequencing data and build a robust ML pipeline that can generalize across hospitals, sequencing runs, and batch effects.
Dataset
You are given a curated feature table derived from raw sequencing (assume alignment and quantification are already done). Each row is a sample.
| Feature Group | Approx. Count | Examples | Notes |
|---|
| Gene expression | 20,000 | TPM/CPM per gene | Long-tailed, sparse-ish, heavy batch effects |
| QC metrics | 25 | mapping_rate, duplication_rate, rRNA_rate, insert_size | Strong confounders; used for filtering and modeling |
| Metadata | 10 | hospital_id, library_prep_kit, sequencer_model, read_length | Some missing; high leakage risk if mishandled |
| Target label | 1 | cancer_type (18 classes) | Multi-class; imbalanced |
- Size: 120,000 samples total across 18 tumor classes; ~1.2 TB raw, ~9 GB feature table
- Class balance: Largest class ~22% (breast), smallest ~0.6% (bile duct)
- Missingness: ~8% missing in metadata fields (prep kit, sequencer model), ~2% samples with partial QC metrics
- Shift: Two new hospitals (10% of samples) use a newer library prep kit not present in early data
Success Criteria
- Clinical utility: Top-1 accuracy ≥ 0.78 overall and macro-F1 ≥ 0.70.
- Rare classes: Recall ≥ 0.55 for the bottom 5 classes by prevalence.
- Calibration: Expected Calibration Error (ECE) ≤ 0.04; enable a “no-call” threshold to keep false positives low.
- Generalization: Performance drop on held-out hospitals ≤ 5% absolute macro-F1 vs in-hospital test.
Constraints
- Interpretability: Must provide gene-level explanations suitable for clinical review (e.g., SHAP on a sparse linear model).
- Regulatory/audit: Reproducible preprocessing; strict train/test separation by hospital and time to avoid leakage.
- Compute: Training must finish in < 2 hours on a single 16-core CPU box with 64 GB RAM (no GPU assumed).
- Deployment: Batch scoring nightly; per-sample inference < 200 ms.
Deliverables (what you must produce)
- A proposed end-to-end approach to analyze RNA-seq data for classification, including normalization, batch effect handling, and feature selection/dimensionality reduction.
- A training/validation strategy that avoids leakage and measures generalization across hospitals.
- A baseline model and a stronger model, with justification for each.
- Evaluation plan with metrics aligned to class imbalance and clinical risk (including calibration and abstention/no-call).
- A brief plan for monitoring drift (new prep kits/hospitals) and when to retrain.