Context
Agero’s roadside assistance and dispatch teams rely on ML models for ETA prediction and service-provider matching. The current training pipeline was built incrementally in Amazon MWAA (Airflow) with Python scripts, ad hoc Spark jobs, and direct writes into Snowflake; it now takes 9-11 hours to rebuild daily features, is hard to backfill, and frequently breaks when schemas change.
You are asked to redesign this pipeline to reduce technical debt and improve scalability without disrupting downstream model training or analytics consumers in Agero’s data platform.
Scale Requirements
- Sources: dispatch events, provider status updates, GPS pings, claims/service outcomes, and third-party traffic feeds
- Volume: 180M GPS/location events/day, 25M dispatch lifecycle events/day, 4 TB raw data/day
- Latency: training features available by 6:00 AM ET; selected operational features refreshed every 15 minutes
- Retention: 24 months raw, 12 months curated feature history
- Backfill: must support replaying 90 days of feature generation in under 36 hours
Requirements
- Design a refactored pipeline for ingesting raw operational data into a reliable bronze/silver/gold layout.
- Separate reusable feature computation from model-specific logic so multiple ETA and matching models can share the same feature sets.
- Support both daily batch training datasets and near-real-time feature refreshes for operational reporting.
- Ensure idempotent reruns, deterministic backfills, and schema evolution handling.
- Add data quality checks for null spikes, duplicate event IDs, late-arriving records, and feature freshness.
- Orchestrate dependencies across ingestion, transformation, validation, and model dataset publication.
- Preserve compatibility for existing Snowflake consumers during migration.
Constraints
- AWS is the primary cloud; prefer services already common in Agero’s stack.
- Team size: 5 data engineers, 2 ML engineers; minimize bespoke infrastructure.
- Budget target: <20% increase over current monthly pipeline spend.
- PII and location data require auditability, access controls, and reproducible lineage for incident review.