Business Context
You’re interviewing for an ML engineering role at VoltGrid, a logistics-and-energy operator that runs 18,000 industrial pumps and compressors across 420 facilities (ports, warehouses, and fuel depots). Unplanned failures can halt operations and trigger environmental compliance incidents. A single pump failure costs $30K–$120K in downtime and emergency repair, but unnecessary preventive maintenance also has a real cost: each truck roll is ~$900 plus lost utilization.
VoltGrid has instrumented assets with high-frequency sensors and maintains a CMMS (maintenance ticketing) system. The VP of Reliability wants a model that can predict whether an asset will fail in the next 7 days, so planners can schedule maintenance during low-demand windows.
The core challenge: failures are extremely rare and labels are noisy (some failures are never logged; some tickets are mislabeled as “failure”). You must propose a robust approach that works in production.
Dataset
You are given a feature table built from raw telemetry and maintenance logs. Each row is an asset-day snapshot.
| Feature Group | Count | Examples | Notes |
|---|
| Sensor aggregates | 28 | mean_vibration_1h, rms_vibration_24h, temp_max_24h, pressure_std_6h | Aggregated from 1 Hz telemetry into rolling windows |
| Trend / change features | 14 | vib_slope_7d, temp_delta_24h, pressure_drift_3d | Computed per asset; sensitive to missingness |
| Operating context | 9 | load_pct, rpm_avg_24h, start_stop_count_24h, ambient_temp | Context shifts across facilities |
| Asset metadata | 8 | asset_type, manufacturer, install_age_days, facility_id | High-cardinality facility_id |
| Maintenance history | 11 | days_since_last_service, last_service_type, tickets_90d, parts_replaced_180d | Derived from CMMS |
Additional details:
- Size: ~6.5M asset-days spanning 24 months
- Target:
fail_7d = 1 if a confirmed failure occurs within the next 7 days, else 0
- Class balance: ~0.06% positive (≈ 3,900 positives total)
- Missing data:
- 8–12% missing in some sensors due to dropouts
- Entire days missing for some assets (connectivity outages)
- Maintenance fields missing for newly onboarded facilities
Success Criteria
Your model will be used to generate a daily risk list.
- Catch failures: Achieve ≥ 70% recall on failures (7-day horizon).
- Control false alarms: At the operating threshold, keep precision ≥ 10% (roughly ≤ 9 false alarms per true positive), because the maintenance team can only action ~120 work orders/day across the fleet.
- Rank quality: The top 1% highest-risk asset-days should contain ≥ 25% of all failures (high lift for triage).
- Reliability: Risk scores should be reasonably calibrated so planners can interpret “0.3” vs “0.05”.
Constraints
- Leakage risk: Maintenance tickets and some sensor flags may be recorded after a failure begins. You must prevent label leakage.
- Split strategy: Assets appear across time; random splits can overestimate performance.
- Latency: Daily batch scoring must finish in < 30 minutes on a modest Spark/CPU cluster; per-row inference should be < 5 ms in Python for integration tests.
- Interpretability: Reliability engineers require top drivers per alert (e.g., “vibration slope up 3× baseline”).
- Drift: Seasonality and facility operating regimes change; the model should be retrained at least monthly.
Deliverables (what you must produce in the interview)
- A modeling approach for extreme class imbalance (including loss weighting / sampling and why).
- A leakage-safe labeling and feature windowing plan (what data is allowed at prediction time).
- A train/validation/test strategy that reflects production (time split, asset split, or both).
- A thresholding and evaluation plan using PR-AUC, recall/precision at an operating point, lift, and calibration.
- A production plan: monitoring, retraining cadence, and how you’d handle noisy/missing labels.