Business Context
You’re working with HelixNova Therapeutics, a biotech company running high-throughput perturbation experiments to prioritize drug targets. They measure gene expression signatures from human cell lines after exposure to experimental compounds. The downstream users are computational biologists and translational scientists who must justify mechanistic hypotheses to internal review boards and, eventually, regulators. A black-box model that is slightly more accurate but cannot be explained is often less valuable than a transparent model that yields a plausible biological relationship.
The team wants to predict a continuous phenotype: cell viability at 72 hours after treatment (normalized 0–1). Your task is to propose and prototype two approaches:
- Symbolic regression to discover an explicit mathematical expression (e.g., sparse combinations of genes and nonlinear transforms).
- A standard neural network regressor as a performance-oriented baseline.
You must compare them in a way that a mixed audience (ML + biology) will accept.
Dataset
The dataset comes from a pooled set of experiments across multiple assay batches and cell lines.
| Component | Details |
|---|
| Rows | 240,000 treatment samples (compound × dose × cell line × replicate) |
| Features (X) | 1,024 numeric gene-expression features (log2 fold-change vs control), plus 12 metadata features |
| Metadata examples | cell_line_id (categorical), dose_uM (numeric), assay_batch (categorical), timepoint_hr (mostly 72) |
| Target (y) | viability_72h (float in [0, 1]) |
| Data quirks | Strong batch effects; correlated genes; heavy-tailed noise; occasional outliers from failed wells |
| Missingness | ~3% missing gene values (low-expression genes dropped in some runs); ~8% missing metadata for legacy batches |
Split requirement (avoid leakage)
- You must evaluate generalization to new compounds. Use a grouped split by compound_id for test, and ideally a nested CV or grouped CV for tuning.
Success Criteria
- Predictive performance: achieve RMSE ≤ 0.12 on a held-out set of unseen compounds (baseline RMSE is ~0.18 using mean viability per cell line).
- Interpretability:
- Symbolic model must be expressible as a compact equation using ≤ 20 terms and ≤ 30 unique genes.
- Provide a ranked list of genes/terms with directionality and biological plausibility.
- Stability: equation should be reasonably stable across folds (no wildly different formulas each run).
- Operational: inference must run in batch to score 10 million candidate (compound, dose, cell line) combinations weekly; per-row latency is not strict, but throughput matters.
Constraints
- Interpretability is a hard requirement for the primary model used in decision-making.
- You cannot use pathway databases at training time (assume licensing constraints), but you can do generic feature engineering.
- Compute budget: single machine with 16 vCPU / 1 GPU (optional), 64 GB RAM; training must finish within 2 hours.
Deliverables
- Explain, in production terms, the advantages and disadvantages of symbolic regression vs a neural network for this biological setting (interpretability, extrapolation, robustness, maintenance).
- Propose an evaluation plan with grouped CV by compound and metrics beyond RMSE (e.g., calibration by dose bins, error by cell line).
- Implement both models (a practical symbolic regression baseline and a neural network) and compare results.
- Show how you would constrain symbolic regression complexity (operators allowed, term limits, regularization) and how you would prevent overfitting.
- Recommend which model to ship as the “decision model,” and whether the other should be used as a challenger or for residual analysis.