Context
Amazon Web Services is supporting a consumer-facing application that emits clickstream events from web, mobile, and backend services. The current pipeline relies on hourly file drops to Amazon S3 and scheduled batch processing, which is too slow for real-time analytics, anomaly detection, and operational dashboards.
You are the engineering manager responsible for designing a new AWS-native pipeline that ingests, validates, enriches, and serves clickstream data in near real time while preserving a durable raw history for replay and backfills.
Scale Requirements
- Users: 80M monthly active users, 12M daily active users
- Peak throughput: 1M events/second during launches and promotions; 250K events/second average
- Event size: 1.5-2.5 KB JSON
- Daily volume: ~20-25 TB raw compressed data
- Latency target: P95 event-to-queryable latency under 2 minutes
- Retention: 180 days raw in Amazon S3, 2 years curated aggregates in Amazon Redshift
- Availability target: 99.95% ingestion availability across 3 AWS Availability Zones
Requirements
- Design a streaming ingestion layer on AWS that can absorb bursty traffic without dropping events.
- Validate schemas, deduplicate retries, and quarantine malformed records without blocking healthy traffic.
- Enrich events with sessionization, device metadata, geo lookup, and user identity joins.
- Store immutable raw data in Amazon S3 and load analytics-ready tables into Amazon Redshift with near-real-time freshness.
- Support replay, backfills, and idempotent reprocessing for a full day of historical traffic.
- Define orchestration, observability, and on-call strategies for lag, data quality regressions, and downstream failures.
- Explain partitioning, checkpointing, and scaling decisions at the broker, compute, and warehouse layers.
Constraints
- Prefer managed AWS services where possible: e.g., Amazon Kinesis Data Streams, AWS Glue Streaming, Amazon Managed Service for Apache Flink, AWS Step Functions, Amazon Redshift, and Amazon CloudWatch.
- Budget target is under $60K/month incremental spend outside existing S3 and Redshift commitments.
- Must support GDPR/CCPA deletion workflows within 72 hours.
- Downstream consumers include BI dashboards, product analytics, and ML feature generation, so schema evolution and data contracts must be explicit.