Context
ShopPulse, a mid-market retail SaaS platform, collects application events from its web app, backend services, and payment system. Today, teams rely on ad hoc cron jobs and direct warehouse loads, which create inconsistent metrics, duplicate records, and 8-12 hour reporting delays.
You need to design a production-grade data flow for a typical application pipeline that supports both operational monitoring and analytics. Assume the company wants a unified architecture for batch and streaming ingestion into a cloud warehouse, with clear ownership, observability, and recovery procedures.
Scale Requirements
- Sources: Web/mobile events, PostgreSQL OLTP CDC, third-party payment API exports
- Throughput: 120K events/sec peak, 25K avg
- Daily volume: ~2.5 TB raw JSON/CSV, 18B records/month
- Latency: operational dashboards < 2 minutes; curated warehouse models < 15 minutes
- Retention: raw zone 180 days, curated warehouse 3 years
- Availability target: 99.9% pipeline uptime
Requirements
- Design the end-to-end flow from source systems to raw, cleaned, and curated datasets.
- Support both streaming ingestion for product events and batch/CDC ingestion for transactional data.
- Enforce schema validation, deduplication, idempotent loads, and replay capability.
- Build transformations for sessionized events, order facts, and customer dimension tables.
- Orchestrate dependencies between ingestion, quality checks, warehouse loads, and downstream dbt models.
- Define monitoring for freshness, lag, failed loads, and data quality regressions.
- Explain how backfills and late-arriving data are handled without corrupting aggregates.
Constraints
- AWS is the required cloud platform.
- Team size is 3 data engineers and 1 analytics engineer; operational complexity should be moderate.
- Incremental infrastructure budget is capped at $18K/month.
- PII must be encrypted at rest and support deletion requests within 72 hours.
- Existing warehouse is Snowflake, and existing orchestration standard is Apache Airflow 2.x.