Predictive Maintenance with Rare Failures

Business Context

You’re interviewing for an ML engineering role at VoltGrid, a logistics-and-energy operator that runs 18,000 industrial pumps and compressors across 420 facilities (ports, warehouses, and fuel depots). Unplanned failures can halt operations and trigger environmental compliance incidents. A single pump failure costs $30K–$120K in downtime and emergency repair, but unnecessary preventive maintenance also has a real cost: each truck roll is ~$900 plus lost utilization.

VoltGrid has instrumented assets with high-frequency sensors and maintains a CMMS (maintenance ticketing) system. The VP of Reliability wants a model that can predict whether an asset will fail in the next 7 days, so planners can schedule maintenance during low-demand windows.

The core challenge: failures are extremely rare and labels are noisy (some failures are never logged; some tickets are mislabeled as “failure”). You must propose a robust approach that works in production.

Dataset

You are given a feature table built from raw telemetry and maintenance logs. Each row is an asset-day snapshot.

Feature Group	Count	Examples	Notes
Sensor aggregates	28	mean_vibration_1h, rms_vibration_24h, temp_max_24h, pressure_std_6h	Aggregated from 1 Hz telemetry into rolling windows
Trend / change features	14	vib_slope_7d, temp_delta_24h, pressure_drift_3d	Computed per asset; sensitive to missingness
Operating context	9	load_pct, rpm_avg_24h, start_stop_count_24h, ambient_temp	Context shifts across facilities
Asset metadata	8	asset_type, manufacturer, install_age_days, facility_id	High-cardinality facility_id
Maintenance history	11	days_since_last_service, last_service_type, tickets_90d, parts_replaced_180d	Derived from CMMS

Additional details:

Size: ~6.5M asset-days spanning 24 months
Target: fail_7d = 1 if a confirmed failure occurs within the next 7 days, else 0
Class balance: ~0.06% positive (≈ 3,900 positives total)
Missing data:
- 8–12% missing in some sensors due to dropouts
- Entire days missing for some assets (connectivity outages)
- Maintenance fields missing for newly onboarded facilities

Success Criteria

Your model will be used to generate a daily risk list.

Catch failures: Achieve ≥ 70% recall on failures (7-day horizon).
Control false alarms: At the operating threshold, keep precision ≥ 10% (roughly ≤ 9 false alarms per true positive), because the maintenance team can only action ~120 work orders/day across the fleet.
Rank quality: The top 1% highest-risk asset-days should contain ≥ 25% of all failures (high lift for triage).
Reliability: Risk scores should be reasonably calibrated so planners can interpret “0.3” vs “0.05”.

Constraints

Leakage risk: Maintenance tickets and some sensor flags may be recorded after a failure begins. You must prevent label leakage.
Split strategy: Assets appear across time; random splits can overestimate performance.
Latency: Daily batch scoring must finish in < 30 minutes on a modest Spark/CPU cluster; per-row inference should be < 5 ms in Python for integration tests.
Interpretability: Reliability engineers require top drivers per alert (e.g., “vibration slope up 3× baseline”).
Drift: Seasonality and facility operating regimes change; the model should be retrained at least monthly.

Deliverables (what you must produce in the interview)

A modeling approach for extreme class imbalance (including loss weighting / sampling and why).
A leakage-safe labeling and feature windowing plan (what data is allowed at prediction time).
A train/validation/test strategy that reflects production (time split, asset split, or both).
A thresholding and evaluation plan using PR-AUC, recall/precision at an operating point, lift, and calibration.
A production plan: monitoring, retraining cadence, and how you’d handle noisy/missing labels.

Business Context

Dataset

You are given a feature table built from raw telemetry and maintenance logs. Each row is an asset-day snapshot.

Feature Group	Count	Examples	Notes
Sensor aggregates	28	mean_vibration_1h, rms_vibration_24h, temp_max_24h, pressure_std_6h	Aggregated from 1 Hz telemetry into rolling windows
Trend / change features	14	vib_slope_7d, temp_delta_24h, pressure_drift_3d	Computed per asset; sensitive to missingness
Operating context	9	load_pct, rpm_avg_24h, start_stop_count_24h, ambient_temp	Context shifts across facilities
Asset metadata	8	asset_type, manufacturer, install_age_days, facility_id	High-cardinality facility_id
Maintenance history	11	days_since_last_service, last_service_type, tickets_90d, parts_replaced_180d	Derived from CMMS

Additional details:

Size: ~6.5M asset-days spanning 24 months
Target: fail_7d = 1 if a confirmed failure occurs within the next 7 days, else 0
Class balance: ~0.06% positive (≈ 3,900 positives total)
Missing data:
- 8–12% missing in some sensors due to dropouts
- Entire days missing for some assets (connectivity outages)
- Maintenance fields missing for newly onboarded facilities

Success Criteria

Your model will be used to generate a daily risk list.

Catch failures: Achieve ≥ 70% recall on failures (7-day horizon).
Control false alarms: At the operating threshold, keep precision ≥ 10% (roughly ≤ 9 false alarms per true positive), because the maintenance team can only action ~120 work orders/day across the fleet.
Rank quality: The top 1% highest-risk asset-days should contain ≥ 25% of all failures (high lift for triage).
Reliability: Risk scores should be reasonably calibrated so planners can interpret “0.3” vs “0.05”.

Constraints

Leakage risk: Maintenance tickets and some sensor flags may be recorded after a failure begins. You must prevent label leakage.
Split strategy: Assets appear across time; random splits can overestimate performance.
Latency: Daily batch scoring must finish in < 30 minutes on a modest Spark/CPU cluster; per-row inference should be < 5 ms in Python for integration tests.
Interpretability: Reliability engineers require top drivers per alert (e.g., “vibration slope up 3× baseline”).
Drift: Seasonality and facility operating regimes change; the model should be retrained at least monthly.

Deliverables (what you must produce in the interview)

A modeling approach for extreme class imbalance (including loss weighting / sampling and why).
A leakage-safe labeling and feature windowing plan (what data is allowed at prediction time).
A train/validation/test strategy that reflects production (time split, asset split, or both).
A thresholding and evaluation plan using PR-AUC, recall/precision at an operating point, lift, and calibration.
A production plan: monitoring, retraining cadence, and how you’d handle noisy/missing labels.

Business Context

Dataset

You are given a feature table built from raw telemetry and maintenance logs. Each row is an asset-day snapshot.

Feature Group	Count	Examples	Notes
Sensor aggregates	28	mean_vibration_1h, rms_vibration_24h, temp_max_24h, pressure_std_6h	Aggregated from 1 Hz telemetry into rolling windows
Trend / change features	14	vib_slope_7d, temp_delta_24h, pressure_drift_3d	Computed per asset; sensitive to missingness
Operating context	9	load_pct, rpm_avg_24h, start_stop_count_24h, ambient_temp	Context shifts across facilities
Asset metadata	8	asset_type, manufacturer, install_age_days, facility_id	High-cardinality facility_id
Maintenance history	11	days_since_last_service, last_service_type, tickets_90d, parts_replaced_180d	Derived from CMMS

Additional details:

Size: ~6.5M asset-days spanning 24 months
Target: fail_7d = 1 if a confirmed failure occurs within the next 7 days, else 0
Class balance: ~0.06% positive (≈ 3,900 positives total)
Missing data:
- 8–12% missing in some sensors due to dropouts
- Entire days missing for some assets (connectivity outages)
- Maintenance fields missing for newly onboarded facilities

Success Criteria

Your model will be used to generate a daily risk list.

Catch failures: Achieve ≥ 70% recall on failures (7-day horizon).
Control false alarms: At the operating threshold, keep precision ≥ 10% (roughly ≤ 9 false alarms per true positive), because the maintenance team can only action ~120 work orders/day across the fleet.
Rank quality: The top 1% highest-risk asset-days should contain ≥ 25% of all failures (high lift for triage).
Reliability: Risk scores should be reasonably calibrated so planners can interpret “0.3” vs “0.05”.

Constraints

Leakage risk: Maintenance tickets and some sensor flags may be recorded after a failure begins. You must prevent label leakage.
Split strategy: Assets appear across time; random splits can overestimate performance.
Latency: Daily batch scoring must finish in < 30 minutes on a modest Spark/CPU cluster; per-row inference should be < 5 ms in Python for integration tests.
Interpretability: Reliability engineers require top drivers per alert (e.g., “vibration slope up 3× baseline”).
Drift: Seasonality and facility operating regimes change; the model should be retrained at least monthly.

Deliverables (what you must produce in the interview)

A modeling approach for extreme class imbalance (including loss weighting / sampling and why).
A leakage-safe labeling and feature windowing plan (what data is allowed at prediction time).
A train/validation/test strategy that reflects production (time split, asset split, or both).
A thresholding and evaluation plan using PR-AUC, recall/precision at an operating point, lift, and calibration.
A production plan: monitoring, retraining cadence, and how you’d handle noisy/missing labels.

Business Context

Dataset

You are given a feature table built from raw telemetry and maintenance logs. Each row is an asset-day snapshot.

Feature Group	Count	Examples	Notes
Sensor aggregates	28	mean_vibration_1h, rms_vibration_24h, temp_max_24h, pressure_std_6h	Aggregated from 1 Hz telemetry into rolling windows
Trend / change features	14	vib_slope_7d, temp_delta_24h, pressure_drift_3d	Computed per asset; sensitive to missingness
Operating context	9	load_pct, rpm_avg_24h, start_stop_count_24h, ambient_temp	Context shifts across facilities
Asset metadata	8	asset_type, manufacturer, install_age_days, facility_id	High-cardinality facility_id
Maintenance history	11	days_since_last_service, last_service_type, tickets_90d, parts_replaced_180d	Derived from CMMS

Additional details:

Size: ~6.5M asset-days spanning 24 months
Target: fail_7d = 1 if a confirmed failure occurs within the next 7 days, else 0
Class balance: ~0.06% positive (≈ 3,900 positives total)
Missing data:
- 8–12% missing in some sensors due to dropouts
- Entire days missing for some assets (connectivity outages)
- Maintenance fields missing for newly onboarded facilities

Success Criteria

Your model will be used to generate a daily risk list.

Catch failures: Achieve ≥ 70% recall on failures (7-day horizon).
Control false alarms: At the operating threshold, keep precision ≥ 10% (roughly ≤ 9 false alarms per true positive), because the maintenance team can only action ~120 work orders/day across the fleet.
Rank quality: The top 1% highest-risk asset-days should contain ≥ 25% of all failures (high lift for triage).
Reliability: Risk scores should be reasonably calibrated so planners can interpret “0.3” vs “0.05”.

Constraints

Leakage risk: Maintenance tickets and some sensor flags may be recorded after a failure begins. You must prevent label leakage.
Split strategy: Assets appear across time; random splits can overestimate performance.
Latency: Daily batch scoring must finish in < 30 minutes on a modest Spark/CPU cluster; per-row inference should be < 5 ms in Python for integration tests.
Interpretability: Reliability engineers require top drivers per alert (e.g., “vibration slope up 3× baseline”).
Drift: Seasonality and facility operating regimes change; the model should be retrained at least monthly.

Deliverables (what you must produce in the interview)

A modeling approach for extreme class imbalance (including loss weighting / sampling and why).
A leakage-safe labeling and feature windowing plan (what data is allowed at prediction time).
A train/validation/test strategy that reflects production (time split, asset split, or both).
A thresholding and evaluation plan using PR-AUC, recall/precision at an operating point, lift, and calibration.
A production plan: monitoring, retraining cadence, and how you’d handle noisy/missing labels.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce in the interview)

Predictive Maintenance with Rare Failures

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce in the interview)

Your Answer

Predictive Maintenance with Rare Failures

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce in the interview)

Predictive Maintenance with Rare Failures

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce in the interview)

Your Answer