Design AWS Clickstream Streaming Pipeline

Context

Amazon Web Services is supporting a consumer-facing application that emits clickstream events from web, mobile, and backend services. The current pipeline relies on hourly file drops to Amazon S3 and scheduled batch processing, which is too slow for real-time analytics, anomaly detection, and operational dashboards.

You are the engineering manager responsible for designing a new AWS-native pipeline that ingests, validates, enriches, and serves clickstream data in near real time while preserving a durable raw history for replay and backfills.

Scale Requirements

Users: 80M monthly active users, 12M daily active users
Peak throughput: 1M events/second during launches and promotions; 250K events/second average
Event size: 1.5-2.5 KB JSON
Daily volume: ~20-25 TB raw compressed data
Latency target: P95 event-to-queryable latency under 2 minutes
Retention: 180 days raw in Amazon S3, 2 years curated aggregates in Amazon Redshift
Availability target: 99.95% ingestion availability across 3 AWS Availability Zones

Requirements

Design a streaming ingestion layer on AWS that can absorb bursty traffic without dropping events.
Validate schemas, deduplicate retries, and quarantine malformed records without blocking healthy traffic.
Enrich events with sessionization, device metadata, geo lookup, and user identity joins.
Store immutable raw data in Amazon S3 and load analytics-ready tables into Amazon Redshift with near-real-time freshness.
Support replay, backfills, and idempotent reprocessing for a full day of historical traffic.
Define orchestration, observability, and on-call strategies for lag, data quality regressions, and downstream failures.
Explain partitioning, checkpointing, and scaling decisions at the broker, compute, and warehouse layers.

Constraints

Prefer managed AWS services where possible: e.g., Amazon Kinesis Data Streams, AWS Glue Streaming, Amazon Managed Service for Apache Flink, AWS Step Functions, Amazon Redshift, and Amazon CloudWatch.
Budget target is under $60K/month incremental spend outside existing S3 and Redshift commitments.
Must support GDPR/CCPA deletion workflows within 72 hours.
Downstream consumers include BI dashboards, product analytics, and ML feature generation, so schema evolution and data contracts must be explicit.

Context

Scale Requirements

Users: 80M monthly active users, 12M daily active users
Peak throughput: 1M events/second during launches and promotions; 250K events/second average
Event size: 1.5-2.5 KB JSON
Daily volume: ~20-25 TB raw compressed data
Latency target: P95 event-to-queryable latency under 2 minutes
Retention: 180 days raw in Amazon S3, 2 years curated aggregates in Amazon Redshift
Availability target: 99.95% ingestion availability across 3 AWS Availability Zones

Requirements

Design a streaming ingestion layer on AWS that can absorb bursty traffic without dropping events.
Validate schemas, deduplicate retries, and quarantine malformed records without blocking healthy traffic.
Enrich events with sessionization, device metadata, geo lookup, and user identity joins.
Store immutable raw data in Amazon S3 and load analytics-ready tables into Amazon Redshift with near-real-time freshness.
Support replay, backfills, and idempotent reprocessing for a full day of historical traffic.
Define orchestration, observability, and on-call strategies for lag, data quality regressions, and downstream failures.
Explain partitioning, checkpointing, and scaling decisions at the broker, compute, and warehouse layers.

Constraints

Prefer managed AWS services where possible: e.g., Amazon Kinesis Data Streams, AWS Glue Streaming, Amazon Managed Service for Apache Flink, AWS Step Functions, Amazon Redshift, and Amazon CloudWatch.
Budget target is under $60K/month incremental spend outside existing S3 and Redshift commitments.
Must support GDPR/CCPA deletion workflows within 72 hours.
Downstream consumers include BI dashboards, product analytics, and ML feature generation, so schema evolution and data contracts must be explicit.

Context

Scale Requirements

Users: 80M monthly active users, 12M daily active users
Peak throughput: 1M events/second during launches and promotions; 250K events/second average
Event size: 1.5-2.5 KB JSON
Daily volume: ~20-25 TB raw compressed data
Latency target: P95 event-to-queryable latency under 2 minutes
Retention: 180 days raw in Amazon S3, 2 years curated aggregates in Amazon Redshift
Availability target: 99.95% ingestion availability across 3 AWS Availability Zones

Requirements

Design a streaming ingestion layer on AWS that can absorb bursty traffic without dropping events.
Validate schemas, deduplicate retries, and quarantine malformed records without blocking healthy traffic.
Enrich events with sessionization, device metadata, geo lookup, and user identity joins.
Store immutable raw data in Amazon S3 and load analytics-ready tables into Amazon Redshift with near-real-time freshness.
Support replay, backfills, and idempotent reprocessing for a full day of historical traffic.
Define orchestration, observability, and on-call strategies for lag, data quality regressions, and downstream failures.
Explain partitioning, checkpointing, and scaling decisions at the broker, compute, and warehouse layers.

Constraints

Prefer managed AWS services where possible: e.g., Amazon Kinesis Data Streams, AWS Glue Streaming, Amazon Managed Service for Apache Flink, AWS Step Functions, Amazon Redshift, and Amazon CloudWatch.
Budget target is under $60K/month incremental spend outside existing S3 and Redshift commitments.
Must support GDPR/CCPA deletion workflows within 72 hours.
Downstream consumers include BI dashboards, product analytics, and ML feature generation, so schema evolution and data contracts must be explicit.

Context

Scale Requirements

Users: 80M monthly active users, 12M daily active users
Peak throughput: 1M events/second during launches and promotions; 250K events/second average
Event size: 1.5-2.5 KB JSON
Daily volume: ~20-25 TB raw compressed data
Latency target: P95 event-to-queryable latency under 2 minutes
Retention: 180 days raw in Amazon S3, 2 years curated aggregates in Amazon Redshift
Availability target: 99.95% ingestion availability across 3 AWS Availability Zones

Requirements

Design a streaming ingestion layer on AWS that can absorb bursty traffic without dropping events.
Validate schemas, deduplicate retries, and quarantine malformed records without blocking healthy traffic.
Enrich events with sessionization, device metadata, geo lookup, and user identity joins.
Store immutable raw data in Amazon S3 and load analytics-ready tables into Amazon Redshift with near-real-time freshness.
Support replay, backfills, and idempotent reprocessing for a full day of historical traffic.
Define orchestration, observability, and on-call strategies for lag, data quality regressions, and downstream failures.
Explain partitioning, checkpointing, and scaling decisions at the broker, compute, and warehouse layers.

Constraints

Prefer managed AWS services where possible: e.g., Amazon Kinesis Data Streams, AWS Glue Streaming, Amazon Managed Service for Apache Flink, AWS Step Functions, Amazon Redshift, and Amazon CloudWatch.
Budget target is under $60K/month incremental spend outside existing S3 and Redshift commitments.
Must support GDPR/CCPA deletion workflows within 72 hours.
Downstream consumers include BI dashboards, product analytics, and ML feature generation, so schema evolution and data contracts must be explicit.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design AWS Clickstream Streaming Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design AWS Clickstream Streaming Pipeline

Context

Scale Requirements

Requirements

Constraints

Design AWS Clickstream Streaming Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer