OncoMap Bio is building a clinical decision-support workflow that uses next-generation sequencing (NGS) tumor profiles to help researchers distinguish cancer subtypes and prioritize downstream analysis. You need to train a machine learning model that predicts cancer type from NGS-derived features while remaining interpretable enough for translational research teams.
You are given a retrospective dataset of tumor samples collected from multiple hospital partners.
| Feature Group | Count | Examples |
|---|---|---|
| Gene expression | 120 | normalized expression of driver genes such as TP53, EGFR, BRCA1 |
| Somatic mutation indicators | 35 | binary flags for common pathogenic variants |
| Copy number alteration features | 18 | amplification/deletion burden, chromosome-arm events |
| Sequencing quality metrics | 7 | read depth, mapping rate, tumor purity estimate |
| Clinical covariates | 6 | age, sex, biopsy site, smoking history |
A good solution should achieve strong multiclass discrimination with macro F1 >= 0.78 and one-vs-rest ROC-AUC >= 0.90 on a held-out test set. The team also wants feature importance outputs to explain which genomic signals drive predictions.