Design Customer Event Data Flow

Context

ShopPulse, a mid-market retail SaaS platform, collects application events from its web app, backend services, and payment system. Today, teams rely on ad hoc cron jobs and direct warehouse loads, which create inconsistent metrics, duplicate records, and 8-12 hour reporting delays.

You need to design a production-grade data flow for a typical application pipeline that supports both operational monitoring and analytics. Assume the company wants a unified architecture for batch and streaming ingestion into a cloud warehouse, with clear ownership, observability, and recovery procedures.

Scale Requirements

Sources: Web/mobile events, PostgreSQL OLTP CDC, third-party payment API exports
Throughput: 120K events/sec peak, 25K avg
Daily volume: ~2.5 TB raw JSON/CSV, 18B records/month
Latency: operational dashboards < 2 minutes; curated warehouse models < 15 minutes
Retention: raw zone 180 days, curated warehouse 3 years
Availability target: 99.9% pipeline uptime

Requirements

Design the end-to-end flow from source systems to raw, cleaned, and curated datasets.
Support both streaming ingestion for product events and batch/CDC ingestion for transactional data.
Enforce schema validation, deduplication, idempotent loads, and replay capability.
Build transformations for sessionized events, order facts, and customer dimension tables.
Orchestrate dependencies between ingestion, quality checks, warehouse loads, and downstream dbt models.
Define monitoring for freshness, lag, failed loads, and data quality regressions.
Explain how backfills and late-arriving data are handled without corrupting aggregates.

Constraints

AWS is the required cloud platform.
Team size is 3 data engineers and 1 analytics engineer; operational complexity should be moderate.
Incremental infrastructure budget is capped at $18K/month.
PII must be encrypted at rest and support deletion requests within 72 hours.
Existing warehouse is Snowflake, and existing orchestration standard is Apache Airflow 2.x.

Problem

Context

Scale Requirements

Sources: Web/mobile events, PostgreSQL OLTP CDC, third-party payment API exports
Throughput: 120K events/sec peak, 25K avg
Daily volume: ~2.5 TB raw JSON/CSV, 18B records/month
Latency: operational dashboards < 2 minutes; curated warehouse models < 15 minutes
Retention: raw zone 180 days, curated warehouse 3 years
Availability target: 99.9% pipeline uptime

Requirements

Design the end-to-end flow from source systems to raw, cleaned, and curated datasets.
Support both streaming ingestion for product events and batch/CDC ingestion for transactional data.
Enforce schema validation, deduplication, idempotent loads, and replay capability.
Build transformations for sessionized events, order facts, and customer dimension tables.
Orchestrate dependencies between ingestion, quality checks, warehouse loads, and downstream dbt models.
Define monitoring for freshness, lag, failed loads, and data quality regressions.
Explain how backfills and late-arriving data are handled without corrupting aggregates.

Constraints

AWS is the required cloud platform.
Team size is 3 data engineers and 1 analytics engineer; operational complexity should be moderate.
Incremental infrastructure budget is capped at $18K/month.
PII must be encrypted at rest and support deletion requests within 72 hours.
Existing warehouse is Snowflake, and existing orchestration standard is Apache Airflow 2.x.

Problem

Context

Scale Requirements

Sources: Web/mobile events, PostgreSQL OLTP CDC, third-party payment API exports
Throughput: 120K events/sec peak, 25K avg
Daily volume: ~2.5 TB raw JSON/CSV, 18B records/month
Latency: operational dashboards < 2 minutes; curated warehouse models < 15 minutes
Retention: raw zone 180 days, curated warehouse 3 years
Availability target: 99.9% pipeline uptime

Requirements

Design the end-to-end flow from source systems to raw, cleaned, and curated datasets.
Support both streaming ingestion for product events and batch/CDC ingestion for transactional data.
Enforce schema validation, deduplication, idempotent loads, and replay capability.
Build transformations for sessionized events, order facts, and customer dimension tables.
Orchestrate dependencies between ingestion, quality checks, warehouse loads, and downstream dbt models.
Define monitoring for freshness, lag, failed loads, and data quality regressions.
Explain how backfills and late-arriving data are handled without corrupting aggregates.

Constraints

AWS is the required cloud platform.
Team size is 3 data engineers and 1 analytics engineer; operational complexity should be moderate.
Incremental infrastructure budget is capped at $18K/month.
PII must be encrypted at rest and support deletion requests within 72 hours.
Existing warehouse is Snowflake, and existing orchestration standard is Apache Airflow 2.x.

Problem

Context

Scale Requirements

Sources: Web/mobile events, PostgreSQL OLTP CDC, third-party payment API exports
Throughput: 120K events/sec peak, 25K avg
Daily volume: ~2.5 TB raw JSON/CSV, 18B records/month
Latency: operational dashboards < 2 minutes; curated warehouse models < 15 minutes
Retention: raw zone 180 days, curated warehouse 3 years
Availability target: 99.9% pipeline uptime

Requirements

Design the end-to-end flow from source systems to raw, cleaned, and curated datasets.
Support both streaming ingestion for product events and batch/CDC ingestion for transactional data.
Enforce schema validation, deduplication, idempotent loads, and replay capability.
Build transformations for sessionized events, order facts, and customer dimension tables.
Orchestrate dependencies between ingestion, quality checks, warehouse loads, and downstream dbt models.
Define monitoring for freshness, lag, failed loads, and data quality regressions.
Explain how backfills and late-arriving data are handled without corrupting aggregates.

Constraints

AWS is the required cloud platform.
Team size is 3 data engineers and 1 analytics engineer; operational complexity should be moderate.
Incremental infrastructure budget is capped at $18K/month.
PII must be encrypted at rest and support deletion requests within 72 hours.
Existing warehouse is Snowflake, and existing orchestration standard is Apache Airflow 2.x.

Interview Guides

Problem

Context

Scale Requirements

Requirements

Constraints

Problem

Context

Scale Requirements

Requirements

Constraints

Design Customer Event Data Flow

Problem

Context

Scale Requirements

Requirements

Constraints

Problem

Context

Scale Requirements

Requirements

Constraints