Business Context
You’re joining the Process ML team at BioForge Therapeutics, a biotech manufacturer producing a monoclonal antibody in 2,000L fed-batch bioreactors. Each run takes 12–14 days and costs roughly $180K in media, labor, and opportunity cost. Small improvements in final titer (g/L) translate to millions in annual revenue, but aggressive setpoints can cause oxygen limitation, foaming, or cell death, risking batch failure and regulatory deviations.
The upstream engineering team wants an ML-driven approach to optimize controllable process parameters (setpoints) using historical runs and a limited budget of new experiments. Your task is to propose and implement a Gaussian Process Regression (GPR)-based Bayesian optimization loop that recommends the next setpoints to try, balancing exploitation (higher yield) and exploration (uncertainty reduction), while respecting safety constraints.
Dataset
You have data from 312 completed production and pilot runs over the last 18 months. Each run is summarized into a fixed feature vector of setpoints and early-run signals (first 72 hours), plus final outcomes.
| Feature Group | Count | Examples | Notes |
|---|
| Setpoints (decision variables) | 6 | temperature_C (35–38), pH_setpoint (6.8–7.2), DO_percent (20–60), agitation_rpm (80–220), feed_rate_gLday (0.5–2.5), antifoam_mL (0–30) | These are the knobs you can choose for the next run |
| Early process signals (context) | 10 | viable_cell_density_day3, glucose_day3, lactate_day3, OUR_peak_0_72h, pCO2_day3 | Available before committing to full run; partially missing |
| Categorical metadata | 4 | cell_line (A/B), media_lot, reactor_id, operator_shift | Potential batch effects |
| Targets | 3 | final_titer_gL, max_pCO2, batch_failed | Optimization target + safety/quality indicators |
Additional data characteristics:
- Size: 312 runs, 20 features after encoding
- Target: Continuous final_titer_gL (typical range 2–8 g/L)
- Noise: Measurement noise and unobserved disturbances (lot variability) produce heteroscedasticity
- Missingness: ~12% missing in early signals (sensor downtime), not missing completely at random
- Failures: ~7% runs are “batch_failed=1” and often have low titer; they must be treated carefully to avoid the optimizer recommending unsafe regions
Success Criteria
- Offline (historical) modeling: achieve RMSE ≤ 0.55 g/L on a held-out test set and well-calibrated uncertainty (e.g., 90% predictive interval coverage between 85–95%).
- Optimization policy: in a simulated “recommend-next” evaluation, your acquisition strategy should produce a ≥ 8% improvement in expected titer over the current baseline recipe within 10 suggested experiments, while keeping predicted failure probability < 2%.
- Explainability: provide a clear rationale for recommended setpoints (e.g., sensitivity/partial dependence, uncertainty, and constraint margins).
Constraints
- Experiment budget: at most 10 new runs per quarter.
- Safety constraints: must keep predicted max_pCO2 < 160 mmHg and P(batch_failed) < 2%.
- Compute: training must run on a single CPU machine (no GPU requirement); optimization should return a recommendation in < 2 minutes.
- Governance: recommendations must be auditable; avoid black-box heuristics without uncertainty estimates.
Deliverables
- Describe how you would model titer with GPR, including kernel choice and how you’d treat categorical/batch effects.
- Define a Bayesian optimization loop: search space, acquisition function (e.g., EI/UCB), and how you incorporate constraints.
- Propose a validation strategy that avoids leakage and reflects real process drift (e.g., time-based split).
- Provide Python code that trains the model, evaluates calibration, and produces the next recommended setpoints.
- Discuss failure modes (e.g., optimizer exploiting model error) and how you’d monitor and retrain in production.