Business Context
You’re interviewing for a Senior ML Engineer role at BioPure Manufacturing, a contract development and manufacturing organization (CDMO) running 24/7 monoclonal antibody (mAb) purification for 30+ clients. A single downstream batch is worth $0.8–$2.5M, and failed lots trigger regulatory deviations and weeks of investigation. The downstream team wants to use AI to optimize purification process settings (chromatography + filtration) to increase yield while meeting strict purity and host-cell protein (HCP) limits.
The current “optimization” is manual: process engineers tune setpoints based on experience and small DoE studies. However, the plant has accumulated years of historian/LIMS data across different resins, columns, and feed materials. Your task is to design an ML approach that can predict outcomes and recommend controllable setpoints for the next run.
Dataset
You have a curated dataset combining PI historian, chromatography skids, and LIMS assays.
| Feature Group | Count | Examples |
|---|
| Feed & upstream context | 10 | harvest_titer_gL, feed_conductivity_mS, feed_pH, impurity_load_mg, cell_viability_pct |
| Column & resin metadata | 9 | resin_type (A/B/C), column_id, column_diameter_cm, bed_height_cm, cycles_used, ligand_leakage_ppm |
| Controllable setpoints (decision variables) | 14 | load_flow_cmhr, wash_pH, wash_conductivity_mS, elution_pH, gradient_slope, pool_cut_start_mL, pool_cut_end_mL |
| Online sensors / derived signals | 18 | uv280_peak_area, pressure_max_bar, conductivity_profile_stats, temp_C, hold_time_min |
| Ops / environment | 6 | operator_shift, skid_id, CIP_time_min, buffer_lot_id |
- Size: ~62,000 runs (unit operations aggregated to “run-level”), spanning 4 sites and 3 years
- Targets (multi-objective):
yield_pct (continuous, 0–100)
purity_pct (continuous, 0–100)
hcp_ppm (continuous, heavy-tailed)
cycle_time_min (continuous)
- Data issues:
- Missingness: ~12% missing in online sensor summaries (instrument downtime)
- Non-stationarity: resin lots and buffer vendors changed over time
- Leakage risk: some lab assays are only available post-run
Success Criteria
You must deliver a model that supports setpoint optimization for future runs:
- Prediction quality: On a held-out time-based test set, achieve:
- RMSE(yield_pct) ≤ 3.5
- RMSE(purity_pct) ≤ 1.2
- RMSLE(hcp_ppm) ≤ 0.35
- Constraint satisfaction (recommendations): For recommended setpoints, estimated probability of meeting:
- purity ≥ 98.5% and HCP ≤ 50 ppm should be ≥ 95%
- Business impact: Demonstrate (offline) a policy that improves median yield by ≥ 1.0 percentage point without increasing cycle time by > 2%.
Constraints
- Interpretability: Process engineers require understandable drivers (e.g., SHAP + monotonic expectations for certain variables).
- Optimization safety: Recommendations must stay within validated ranges (hard bounds) and avoid high-pressure excursions.
- Deployment: Batch scoring within 10 minutes for ~500 candidate setpoint configurations per run; retraining monthly.
- Data governance: Must support audit trails (model version, training data window, feature lineage).
Deliverables
- Define the ML problem formulation (what you predict vs what you optimize) and justify it.
- Propose a modeling approach for multi-target prediction and uncertainty.
- Describe feature engineering and leakage prevention (what features are allowed at decision time).
- Specify train/validation/test splitting and cross-validation strategy for non-stationary manufacturing data.
- Provide an optimization strategy to recommend setpoints under constraints (not just predict outcomes).
- Provide an evaluation plan including offline metrics and a safe online rollout plan (shadow mode / human-in-the-loop).