Classify Tumor Types from RNA-Seq

Business Context

You’re joining the ML team at OncoDx, a healthcare diagnostics company that partners with 40+ hospital systems to deliver tumor-of-origin predictions for patients with ambiguous pathology. The product ingests RNA-seq from biopsy samples and returns a ranked list of likely cancer types to guide treatment selection. A wrong prediction can lead to inappropriate therapy, while a “no-call” result delays care—so the model must be accurate, calibrated, and auditable for clinical review.

OncoDx has accumulated a large retrospective dataset and wants you to propose techniques to analyze genomic sequencing data and build a robust ML pipeline that can generalize across hospitals, sequencing runs, and batch effects.

Dataset

You are given a curated feature table derived from raw sequencing (assume alignment and quantification are already done). Each row is a sample.

Feature Group	Approx. Count	Examples	Notes
Gene expression	20,000	TPM/CPM per gene	Long-tailed, sparse-ish, heavy batch effects
QC metrics	25	mapping_rate, duplication_rate, rRNA_rate, insert_size	Strong confounders; used for filtering and modeling
Metadata	10	hospital_id, library_prep_kit, sequencer_model, read_length	Some missing; high leakage risk if mishandled
Target label	1	cancer_type (18 classes)	Multi-class; imbalanced

Size: 120,000 samples total across 18 tumor classes; ~1.2 TB raw, ~9 GB feature table
Class balance: Largest class ~22% (breast), smallest ~0.6% (bile duct)
Missingness: ~8% missing in metadata fields (prep kit, sequencer model), ~2% samples with partial QC metrics
Shift: Two new hospitals (10% of samples) use a newer library prep kit not present in early data

Success Criteria

Clinical utility: Top-1 accuracy $\ge$ 0.78 overall and macro-F1 $\ge$ 0.70.
Rare classes: Recall $\ge$ 0.55 for the bottom 5 classes by prevalence.
Calibration: Expected Calibration Error (ECE) $\le$ 0.04; enable a “no-call” threshold to keep false positives low.
Generalization: Performance drop on held-out hospitals $\le$ 5% absolute macro-F1 vs in-hospital test.

Constraints

Interpretability: Must provide gene-level explanations suitable for clinical review (e.g., SHAP on a sparse linear model).
Regulatory/audit: Reproducible preprocessing; strict train/test separation by hospital and time to avoid leakage.
Compute: Training must finish in < 2 hours on a single 16-core CPU box with 64 GB RAM (no GPU assumed).
Deployment: Batch scoring nightly; per-sample inference < 200 ms.

Deliverables (what you must produce)

A proposed end-to-end approach to analyze RNA-seq data for classification, including normalization, batch effect handling, and feature selection/dimensionality reduction.
A training/validation strategy that avoids leakage and measures generalization across hospitals.
A baseline model and a stronger model, with justification for each.
Evaluation plan with metrics aligned to class imbalance and clinical risk (including calibration and abstention/no-call).
A brief plan for monitoring drift (new prep kits/hospitals) and when to retrain.

Business Context

Dataset

You are given a curated feature table derived from raw sequencing (assume alignment and quantification are already done). Each row is a sample.

Feature Group	Approx. Count	Examples	Notes
Gene expression	20,000	TPM/CPM per gene	Long-tailed, sparse-ish, heavy batch effects
QC metrics	25	mapping_rate, duplication_rate, rRNA_rate, insert_size	Strong confounders; used for filtering and modeling
Metadata	10	hospital_id, library_prep_kit, sequencer_model, read_length	Some missing; high leakage risk if mishandled
Target label	1	cancer_type (18 classes)	Multi-class; imbalanced

Size: 120,000 samples total across 18 tumor classes; ~1.2 TB raw, ~9 GB feature table
Class balance: Largest class ~22% (breast), smallest ~0.6% (bile duct)
Missingness: ~8% missing in metadata fields (prep kit, sequencer model), ~2% samples with partial QC metrics
Shift: Two new hospitals (10% of samples) use a newer library prep kit not present in early data

Success Criteria

Clinical utility: Top-1 accuracy $\ge$ 0.78 overall and macro-F1 $\ge$ 0.70.
Rare classes: Recall $\ge$ 0.55 for the bottom 5 classes by prevalence.
Calibration: Expected Calibration Error (ECE) $\le$ 0.04; enable a “no-call” threshold to keep false positives low.
Generalization: Performance drop on held-out hospitals $\le$ 5% absolute macro-F1 vs in-hospital test.

Constraints

Interpretability: Must provide gene-level explanations suitable for clinical review (e.g., SHAP on a sparse linear model).
Regulatory/audit: Reproducible preprocessing; strict train/test separation by hospital and time to avoid leakage.
Compute: Training must finish in < 2 hours on a single 16-core CPU box with 64 GB RAM (no GPU assumed).
Deployment: Batch scoring nightly; per-sample inference < 200 ms.

Deliverables (what you must produce)

A proposed end-to-end approach to analyze RNA-seq data for classification, including normalization, batch effect handling, and feature selection/dimensionality reduction.
A training/validation strategy that avoids leakage and measures generalization across hospitals.
A baseline model and a stronger model, with justification for each.
Evaluation plan with metrics aligned to class imbalance and clinical risk (including calibration and abstention/no-call).
A brief plan for monitoring drift (new prep kits/hospitals) and when to retrain.

Business Context

Dataset

You are given a curated feature table derived from raw sequencing (assume alignment and quantification are already done). Each row is a sample.

Feature Group	Approx. Count	Examples	Notes
Gene expression	20,000	TPM/CPM per gene	Long-tailed, sparse-ish, heavy batch effects
QC metrics	25	mapping_rate, duplication_rate, rRNA_rate, insert_size	Strong confounders; used for filtering and modeling
Metadata	10	hospital_id, library_prep_kit, sequencer_model, read_length	Some missing; high leakage risk if mishandled
Target label	1	cancer_type (18 classes)	Multi-class; imbalanced

Size: 120,000 samples total across 18 tumor classes; ~1.2 TB raw, ~9 GB feature table
Class balance: Largest class ~22% (breast), smallest ~0.6% (bile duct)
Missingness: ~8% missing in metadata fields (prep kit, sequencer model), ~2% samples with partial QC metrics
Shift: Two new hospitals (10% of samples) use a newer library prep kit not present in early data

Success Criteria

Clinical utility: Top-1 accuracy $\ge$ 0.78 overall and macro-F1 $\ge$ 0.70.
Rare classes: Recall $\ge$ 0.55 for the bottom 5 classes by prevalence.
Calibration: Expected Calibration Error (ECE) $\le$ 0.04; enable a “no-call” threshold to keep false positives low.
Generalization: Performance drop on held-out hospitals $\le$ 5% absolute macro-F1 vs in-hospital test.

Constraints

Interpretability: Must provide gene-level explanations suitable for clinical review (e.g., SHAP on a sparse linear model).
Regulatory/audit: Reproducible preprocessing; strict train/test separation by hospital and time to avoid leakage.
Compute: Training must finish in < 2 hours on a single 16-core CPU box with 64 GB RAM (no GPU assumed).
Deployment: Batch scoring nightly; per-sample inference < 200 ms.

Deliverables (what you must produce)

A proposed end-to-end approach to analyze RNA-seq data for classification, including normalization, batch effect handling, and feature selection/dimensionality reduction.
A training/validation strategy that avoids leakage and measures generalization across hospitals.
A baseline model and a stronger model, with justification for each.
Evaluation plan with metrics aligned to class imbalance and clinical risk (including calibration and abstention/no-call).
A brief plan for monitoring drift (new prep kits/hospitals) and when to retrain.

Business Context

Dataset

You are given a curated feature table derived from raw sequencing (assume alignment and quantification are already done). Each row is a sample.

Feature Group	Approx. Count	Examples	Notes
Gene expression	20,000	TPM/CPM per gene	Long-tailed, sparse-ish, heavy batch effects
QC metrics	25	mapping_rate, duplication_rate, rRNA_rate, insert_size	Strong confounders; used for filtering and modeling
Metadata	10	hospital_id, library_prep_kit, sequencer_model, read_length	Some missing; high leakage risk if mishandled
Target label	1	cancer_type (18 classes)	Multi-class; imbalanced

Size: 120,000 samples total across 18 tumor classes; ~1.2 TB raw, ~9 GB feature table
Class balance: Largest class ~22% (breast), smallest ~0.6% (bile duct)
Missingness: ~8% missing in metadata fields (prep kit, sequencer model), ~2% samples with partial QC metrics
Shift: Two new hospitals (10% of samples) use a newer library prep kit not present in early data

Success Criteria

Clinical utility: Top-1 accuracy $\ge$ 0.78 overall and macro-F1 $\ge$ 0.70.
Rare classes: Recall $\ge$ 0.55 for the bottom 5 classes by prevalence.
Calibration: Expected Calibration Error (ECE) $\le$ 0.04; enable a “no-call” threshold to keep false positives low.
Generalization: Performance drop on held-out hospitals $\le$ 5% absolute macro-F1 vs in-hospital test.

Constraints

Interpretability: Must provide gene-level explanations suitable for clinical review (e.g., SHAP on a sparse linear model).
Regulatory/audit: Reproducible preprocessing; strict train/test separation by hospital and time to avoid leakage.
Compute: Training must finish in < 2 hours on a single 16-core CPU box with 64 GB RAM (no GPU assumed).
Deployment: Batch scoring nightly; per-sample inference < 200 ms.

Deliverables (what you must produce)

A proposed end-to-end approach to analyze RNA-seq data for classification, including normalization, batch effect handling, and feature selection/dimensionality reduction.
A training/validation strategy that avoids leakage and measures generalization across hospitals.
A baseline model and a stronger model, with justification for each.
Evaluation plan with metrics aligned to class imbalance and clinical risk (including calibration and abstention/no-call).
A brief plan for monitoring drift (new prep kits/hospitals) and when to retrain.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce)

Classify Tumor Types from RNA-Seq

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce)

Your Answer

Classify Tumor Types from RNA-Seq

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce)

Classify Tumor Types from RNA-Seq

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce)

Your Answer