You are designing a real-time data pipeline for a creator analytics platform that wants to power in-product alerts, trending signals, and recommendation features from fresh engagement data. Today, most metrics are computed in batch, so end users see delays of several hours between a new video event and updated insights. Product and leadership have escalated because creators expect near-real-time feedback on views, watch velocity, and subscriber changes. You need a pipeline that can process streaming events reliably while still supporting replay, backfills, and downstream warehouse analytics.
| Component | Status / Technology |
|---|---|
| Event Sources | Web app events, backend APIs, third-party platform webhooks |
| Ingestion | REST collectors writing JSON to Kafka |
| Processing | Nightly Spark batch jobs only |
| Storage | S3 data lake and Snowflake warehouse |
| Orchestration | Apache Airflow 2.x |
| Serving | Internal APIs and dashboards read from Snowflake |
Scale: ~120K events/sec peak, ~25K avg, 1.5-3 KB/event, 2.5B events/day, freshness target <2 minutes for feature updates, 30-day replay capability, 99.9% pipeline availability.
How would you design this real-time pipeline end to end so that fresh engagement features are available within two minutes while preserving data quality, replayability, and operational simplicity? Explain the architecture and the trade-offs you would make between streaming, batch recovery, and warehouse serving.