Context
PulseApp, a consumer fintech platform with web and mobile clients, currently lands user activity, profile updates, and account events into Amazon S3 through hourly batch exports. This architecture supports reporting, but product, fraud, and lifecycle teams now need user data available in near real time for dashboards, alerts, and downstream services.
You are asked to design a scalable pipeline that ingests user data continuously, validates and enriches it, stores raw and curated datasets, and serves both analytics and operational consumers without breaking existing batch workflows.
Scale Requirements
- Sources: mobile SDK events, web app events, backend service events, CDC from PostgreSQL user tables
- Throughput: 180K events/sec average, 400K events/sec peak
- Event size: 1-3 KB JSON/Avro
- Latency target: P95 end-to-end under 2 minutes
- Daily volume: ~12 TB raw, ~4 TB curated
- Retention: 180 days raw, 2 years curated
- Availability: 99.9% pipeline uptime
Requirements
- Design ingestion for both event streams and database change data capture.
- Ensure schema validation, deduplication, and idempotent processing for retries and replays.
- Support real-time transformations such as user identity resolution, event enrichment, and sessionization.
- Load raw and curated data into a warehouse for analytics and expose low-latency aggregates for operational use cases.
- Support backfills and replay of the last 30 days without corrupting downstream tables.
- Define orchestration, monitoring, alerting, and failure recovery.
- Explain partitioning, storage layout, and data modeling choices.
Constraints
- Existing cloud footprint is AWS; prefer managed services where reasonable.
- Team has strong SQL/Airflow skills, moderate Spark experience, limited Kafka operations experience.
- Must comply with GDPR/CCPA deletion requirements within 72 hours.
- Incremental infrastructure budget is capped at $35K/month.
- Existing batch reports in Snowflake must continue to run during migration.
Provide the target architecture, key trade-offs, and how you would guarantee correctness, scalability, and observability at this scale.