Design Real-Time User Data Pipeline

Context

PulseApp, a consumer fintech platform with web and mobile clients, currently lands user activity, profile updates, and account events into Amazon S3 through hourly batch exports. This architecture supports reporting, but product, fraud, and lifecycle teams now need user data available in near real time for dashboards, alerts, and downstream services.

You are asked to design a scalable pipeline that ingests user data continuously, validates and enriches it, stores raw and curated datasets, and serves both analytics and operational consumers without breaking existing batch workflows.

Scale Requirements

Sources: mobile SDK events, web app events, backend service events, CDC from PostgreSQL user tables
Throughput: 180K events/sec average, 400K events/sec peak
Event size: 1-3 KB JSON/Avro
Latency target: P95 end-to-end under 2 minutes
Daily volume: ~12 TB raw, ~4 TB curated
Retention: 180 days raw, 2 years curated
Availability: 99.9% pipeline uptime

Requirements

Design ingestion for both event streams and database change data capture.
Ensure schema validation, deduplication, and idempotent processing for retries and replays.
Support real-time transformations such as user identity resolution, event enrichment, and sessionization.
Load raw and curated data into a warehouse for analytics and expose low-latency aggregates for operational use cases.
Support backfills and replay of the last 30 days without corrupting downstream tables.
Define orchestration, monitoring, alerting, and failure recovery.
Explain partitioning, storage layout, and data modeling choices.

Constraints

Existing cloud footprint is AWS; prefer managed services where reasonable.
Team has strong SQL/Airflow skills, moderate Spark experience, limited Kafka operations experience.
Must comply with GDPR/CCPA deletion requirements within 72 hours.
Incremental infrastructure budget is capped at $35K/month.
Existing batch reports in Snowflake must continue to run during migration.

Provide the target architecture, key trade-offs, and how you would guarantee correctness, scalability, and observability at this scale.

Context

Scale Requirements

Sources: mobile SDK events, web app events, backend service events, CDC from PostgreSQL user tables
Throughput: 180K events/sec average, 400K events/sec peak
Event size: 1-3 KB JSON/Avro
Latency target: P95 end-to-end under 2 minutes
Daily volume: ~12 TB raw, ~4 TB curated
Retention: 180 days raw, 2 years curated
Availability: 99.9% pipeline uptime

Requirements

Design ingestion for both event streams and database change data capture.
Ensure schema validation, deduplication, and idempotent processing for retries and replays.
Support real-time transformations such as user identity resolution, event enrichment, and sessionization.
Load raw and curated data into a warehouse for analytics and expose low-latency aggregates for operational use cases.
Support backfills and replay of the last 30 days without corrupting downstream tables.
Define orchestration, monitoring, alerting, and failure recovery.
Explain partitioning, storage layout, and data modeling choices.

Constraints

Existing cloud footprint is AWS; prefer managed services where reasonable.
Team has strong SQL/Airflow skills, moderate Spark experience, limited Kafka operations experience.
Must comply with GDPR/CCPA deletion requirements within 72 hours.
Incremental infrastructure budget is capped at $35K/month.
Existing batch reports in Snowflake must continue to run during migration.

Provide the target architecture, key trade-offs, and how you would guarantee correctness, scalability, and observability at this scale.

Context

Scale Requirements

Sources: mobile SDK events, web app events, backend service events, CDC from PostgreSQL user tables
Throughput: 180K events/sec average, 400K events/sec peak
Event size: 1-3 KB JSON/Avro
Latency target: P95 end-to-end under 2 minutes
Daily volume: ~12 TB raw, ~4 TB curated
Retention: 180 days raw, 2 years curated
Availability: 99.9% pipeline uptime

Requirements

Design ingestion for both event streams and database change data capture.
Ensure schema validation, deduplication, and idempotent processing for retries and replays.
Support real-time transformations such as user identity resolution, event enrichment, and sessionization.
Load raw and curated data into a warehouse for analytics and expose low-latency aggregates for operational use cases.
Support backfills and replay of the last 30 days without corrupting downstream tables.
Define orchestration, monitoring, alerting, and failure recovery.
Explain partitioning, storage layout, and data modeling choices.

Constraints

Existing cloud footprint is AWS; prefer managed services where reasonable.
Team has strong SQL/Airflow skills, moderate Spark experience, limited Kafka operations experience.
Must comply with GDPR/CCPA deletion requirements within 72 hours.
Incremental infrastructure budget is capped at $35K/month.
Existing batch reports in Snowflake must continue to run during migration.

Provide the target architecture, key trade-offs, and how you would guarantee correctness, scalability, and observability at this scale.

Context

Scale Requirements

Sources: mobile SDK events, web app events, backend service events, CDC from PostgreSQL user tables
Throughput: 180K events/sec average, 400K events/sec peak
Event size: 1-3 KB JSON/Avro
Latency target: P95 end-to-end under 2 minutes
Daily volume: ~12 TB raw, ~4 TB curated
Retention: 180 days raw, 2 years curated
Availability: 99.9% pipeline uptime

Requirements

Design ingestion for both event streams and database change data capture.
Ensure schema validation, deduplication, and idempotent processing for retries and replays.
Support real-time transformations such as user identity resolution, event enrichment, and sessionization.
Load raw and curated data into a warehouse for analytics and expose low-latency aggregates for operational use cases.
Support backfills and replay of the last 30 days without corrupting downstream tables.
Define orchestration, monitoring, alerting, and failure recovery.
Explain partitioning, storage layout, and data modeling choices.

Constraints

Existing cloud footprint is AWS; prefer managed services where reasonable.
Team has strong SQL/Airflow skills, moderate Spark experience, limited Kafka operations experience.
Must comply with GDPR/CCPA deletion requirements within 72 hours.
Incremental infrastructure budget is capped at $35K/month.
Existing batch reports in Snowflake must continue to run during migration.

Provide the target architecture, key trade-offs, and how you would guarantee correctness, scalability, and observability at this scale.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Real-Time User Data Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Real-Time User Data Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Real-Time User Data Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer