Context
NorthGrid Logistics operates 120 fulfillment centers and currently builds shift plans manually from exported WMS and labor-management reports. Each site receives inbound, picking, packing, and shipping forecasts, but the reports arrive in inconsistent CSV/API formats and are often stale or incomplete.
You need to design a production data pipeline that ingests warehouse activity and labor reports, standardizes them, and produces an hourly shift-planning dataset used by operations managers before each shift.
Scale Requirements
- Warehouses: 120 active sites across North America
- Sources: 3 WMS vendors, 2 labor-management systems, plus HR roster data
- Volume: ~18M report rows/day, ~120 GB raw compressed data/day
- Refresh cadence: every 15 minutes for intraday replanning; full daily rebuild at 02:00 local site time
- Latency target: source arrival to planning table available in < 30 minutes
- Retention: raw data 1 year, curated planning tables 3 years
Requirements
- Ingest batch files (CSV, XLSX, JSON) from SFTP and REST APIs for WMS workload, backlog, productivity, attendance, and labor standards.
- Normalize all sources into a canonical warehouse-operations model with entities such as
site, department, task_type, interval_start, planned_hours, actual_hours, units, and productivity_rate.
- Build transformations that convert workload forecasts and current backlog into required labor hours by department and 1-hour interval.
- Join labor availability data (scheduled associates, absenteeism, overtime rules, cross-trained workers) to generate staffing gaps and recommended shift actions.
- Support idempotent reruns, late-arriving files, and backfills for at least 30 days.
- Publish outputs to Snowflake tables and a BI layer used by site managers by 05:00 local time and every 15 minutes thereafter.
- Implement data quality checks for schema drift, duplicate files, missing sites, negative units, and outlier productivity assumptions.
Constraints
- AWS is the required cloud; existing stack includes S3, Snowflake, and Airflow.
- Incremental monthly budget increase is capped at $35K.
- No source-system changes are allowed; some vendors only provide daily file drops.
- Labor recommendations must be auditable because they affect overtime and staffing decisions.