Context
ShopWave, a mobile commerce company, runs a multi-step product analytics workflow to build daily dashboards for product, growth, and finance teams. The current process uses loosely scheduled SQL jobs and Python scripts, causing downstream tables to run before upstream dependencies finish, which leads to partial metrics and inconsistent reports.
You need to redesign the workflow so dependencies are explicit, failures are isolated, and reruns do not corrupt analytics tables.
Scale Requirements
- Sources: app events, backend orders, user profiles, and product catalog updates
- Volume: 250M events/day, 40M order records/day, 15 TB historical warehouse data
- Workflow cadence: hourly for near-real-time aggregates, daily for finalized reporting
- Latency target: hourly metrics available within 20 minutes of the hour; daily workflow complete by 6:00 AM UTC
- Concurrency: 20-30 DAG tasks per run, up to 10 overlapping backfills
Requirements
- Design a dependency-aware workflow for stages such as raw ingestion, validation, sessionization, attribution, metric aggregation, and dashboard publishing.
- Ensure downstream tasks run only when upstream datasets are complete and validated.
- Support idempotent reruns for a single task, a full DAG run, or a historical backfill without duplicating data.
- Define how task state, retries, SLAs, and late-arriving data should be handled.
- Include data quality checks at critical boundaries before publishing analytics tables.
- Explain how you would model dependencies between hourly and daily jobs, especially when daily jobs depend on all hourly partitions being complete.
- Describe monitoring, alerting, and operational ownership for failed or delayed dependencies.
Constraints
- Existing stack is AWS-based and must continue using Snowflake as the warehouse.
- Team size is 3 data engineers; operational complexity should stay moderate.
- Budget allows managed services but not a large always-on Spark cluster.
- Finance metrics are SOX-sensitive, so published tables must be auditable and reproducible.