Design Real-Time Feature Pipeline

Scenario

You are designing a real-time data pipeline for a creator analytics platform that wants to power in-product alerts, trending signals, and recommendation features from fresh engagement data. Today, most metrics are computed in batch, so end users see delays of several hours between a new video event and updated insights. Product and leadership have escalated because creators expect near-real-time feedback on views, watch velocity, and subscriber changes. You need a pipeline that can process streaming events reliably while still supporting replay, backfills, and downstream warehouse analytics.

Current State

Component	Status / Technology
Event Sources	Web app events, backend APIs, third-party platform webhooks
Ingestion	REST collectors writing JSON to Kafka
Processing	Nightly Spark batch jobs only
Storage	S3 data lake and Snowflake warehouse
Orchestration	Apache Airflow 2.x
Serving	Internal APIs and dashboards read from Snowflake

Scale: ~120K events/sec peak, ~25K avg, 1.5-3 KB/event, 2.5B events/day, freshness target <2 minutes for feature updates, 30-day replay capability, 99.9% pipeline availability.

Question

How would you design this real-time pipeline end to end so that fresh engagement features are available within two minutes while preserving data quality, replayability, and operational simplicity? Explain the architecture and the trade-offs you would make between streaming, batch recovery, and warehouse serving.

Scenario

Current State

Component	Status / Technology
Event Sources	Web app events, backend APIs, third-party platform webhooks
Ingestion	REST collectors writing JSON to Kafka
Processing	Nightly Spark batch jobs only
Storage	S3 data lake and Snowflake warehouse
Orchestration	Apache Airflow 2.x
Serving	Internal APIs and dashboards read from Snowflake

Scale: ~120K events/sec peak, ~25K avg, 1.5-3 KB/event, 2.5B events/day, freshness target <2 minutes for feature updates, 30-day replay capability, 99.9% pipeline availability.

Question

Scenario

Current State

Component	Status / Technology
Event Sources	Web app events, backend APIs, third-party platform webhooks
Ingestion	REST collectors writing JSON to Kafka
Processing	Nightly Spark batch jobs only
Storage	S3 data lake and Snowflake warehouse
Orchestration	Apache Airflow 2.x
Serving	Internal APIs and dashboards read from Snowflake

Scale: ~120K events/sec peak, ~25K avg, 1.5-3 KB/event, 2.5B events/day, freshness target <2 minutes for feature updates, 30-day replay capability, 99.9% pipeline availability.

Question

Scenario

Current State

Component	Status / Technology
Event Sources	Web app events, backend APIs, third-party platform webhooks
Ingestion	REST collectors writing JSON to Kafka
Processing	Nightly Spark batch jobs only
Storage	S3 data lake and Snowflake warehouse
Orchestration	Apache Airflow 2.x
Serving	Internal APIs and dashboards read from Snowflake

Scale: ~120K events/sec peak, ~25K avg, 1.5-3 KB/event, 2.5B events/day, freshness target <2 minutes for feature updates, 30-day replay capability, 99.9% pipeline availability.

Interview Guides

Scenario

Current State

Question

Design Real-Time Feature Pipeline

Scenario

Current State

Question

Your Answer

Design Real-Time Feature Pipeline

Scenario

Current State

Question

Design Real-Time Feature Pipeline

Scenario

Current State

Question

Your Answer