Context
FleetFlow, a mid-size logistics platform, stores driver profiles, shipment orders, GPS pings, and delivery status updates across a transactional PostgreSQL database and partner carrier APIs. The current setup relies on ad hoc nightly exports, causing stale reporting, inconsistent driver-shipment joins, and poor visibility into late deliveries and compliance metrics.
You need to design a production-grade data pipeline and storage model that supports both operational reporting and analytics for driver utilization, shipment lifecycle tracking, and SLA monitoring.
Scale Requirements
- Drivers: 250K active drivers
- Shipments: 8M new shipments/day
- Tracking events: 35K events/sec peak, 8K avg
- GPS pings: 120M records/day
- Latency target: shipment status visible in analytics within 3 minutes; batch reconciliations completed by 6 AM UTC
- Retention: raw events for 180 days, curated warehouse tables for 3 years
- Storage: ~12 TB/month compressed raw data
Requirements
- Design a data model for core entities:
drivers, shipments, shipment_events, driver_assignments, and gps_locations.
- Build ingestion for both batch sources (PostgreSQL CDC snapshots, partner SFTP files) and streaming sources (status events, GPS telemetry).
- Support idempotent processing for duplicate shipment updates and out-of-order driver assignment events.
- Create curated warehouse tables for shipment SLA, driver utilization, on-time delivery, and exception reporting.
- Define orchestration, backfill strategy, and schema evolution handling for partner feeds.
- Include data quality checks for missing driver IDs, invalid status transitions, duplicate shipment IDs, and late-arriving events.
- Provide monitoring, alerting, and failure recovery for ingestion, transformation, and warehouse loading.
Constraints
- Infrastructure must stay on AWS.
- Team has strong SQL/Airflow skills, limited Flink experience.
- Incremental monthly budget increase is capped at $30K.
- Must support auditability for shipment status history and PII minimization for driver data.