Real-Time Churn Features Freshness

Context

You’re interviewing with the Retention & Growth Engineering team at StreamWave, a global subscription video streaming company (think Netflix-scale). StreamWave has 20M paying subscribers, 6M DAU, and runs churn prevention campaigns (in-app offers, email, push notifications) that are most effective when triggered within minutes of a user showing “churn intent” (e.g., repeated playback failures, cancellation page views, payment retries).

Today, churn prediction is trained weekly using batch features in Snowflake (dbt models on top of event tables). Scoring happens in a daily batch, and the CRM team receives a list of “at-risk users” the next morning. This is too slow: the business wants real-time scoring so that a user who experiences multiple failures in the last 10 minutes can be targeted immediately. The ML team already has a model that can run online (a lightweight gradient-boosted model served behind an internal API), but they lack a production-grade feature pipeline that guarantees feature freshness, handles late-arriving events, and maintains online/offline consistency.

You are asked to design the end-to-end data pipeline to generate and serve churn features for real-time inference and also land the same features for training/analytics.

Scale Requirements

Event throughput: avg 120K events/sec, peak 350K events/sec during prime time
Event size: 0.5–2KB JSON
Daily volume: ~8–15TB raw
Freshness SLO (online features):
- P50 < 30 seconds, P95 < 2 minutes from event time to feature availability
Late data:
- 2–3% of events arrive 5–30 minutes late (mobile offline)
- rare tail up to 24 hours (TV devices buffering + retries)
Serving QPS: churn scoring service calls feature store at 20K QPS peak
Retention:
- raw events in data lake: 90 days
- training features in warehouse: 2 years

Data Characteristics

Key sources

Client events (web/mobile/TV): playback_start, playback_error, search, add_to_watchlist, cancel_flow_view, plan_change_view
Billing events (internal payments system): payment_failed, payment_retried, payment_succeeded
Customer support (Zendesk-like): ticket_created, ticket_resolved, sentiment_score

Example event schema (client events)

field	type	notes
event_id	string	UUID; not always present on older TV clients
user_id	string	stable subscriber id
device_id	string	may rotate
event_type	string	enum
event_ts	timestamp	device timestamp (can skew)
ingest_ts	timestamp	server receive time
attributes	variant/json	error_code, content_id, app_version, etc.

Feature examples needed for churn model

errors_last_10m, errors_last_1h
cancel_page_views_last_30m
payment_failures_last_7d
watch_time_last_24h (incremental)
distinct_titles_last_7d
support_tickets_last_30d

Requirements

Functional

Streaming ingestion of all event sources with ordering guarantees per user where feasible.
Real-time feature computation (sliding windows + incremental aggregates) and online serving to a model scoring service.
Feature freshness management: define and enforce freshness SLOs; expose feature timestamps to the model.
Late-arriving data handling with deterministic updates to windowed features and clear policies for “too-late” events.
Online/offline parity: the same feature definitions should be used for training (Snowflake) and serving (online store).
Backfills and reprocessing: ability to recompute features for a historical period (e.g., model retraining, bug fix) without corrupting online state.
Data quality: schema validation, deduplication, anomaly detection (spikes/drops), and auditability.

Non-functional

Exactly-once or effectively-once semantics for feature updates (no double counting).
High availability: tolerate AZ failure; RPO near-zero for committed events.
Security & compliance: GDPR/CCPA deletion within 72 hours across online store, lake, and warehouse.
Cost constraints: incremental platform cost target <$60K/month.

Constraints

Cloud: AWS. Existing: Kafka (MSK) for clickstream, S3 data lake, Snowflake warehouse, Airflow 2.x, dbt.
Team: 5 data engineers, 2 ML engineers. Strong Spark skills; moderate Kafka; limited Flink experience.
Scoring service is deployed on Kubernetes and can call Redis/DynamoDB/Snowflake, but must keep p95 latency < 50ms for feature fetch.

What You Should Design / Explain

The streaming feature pipeline (topics, partitioning strategy, windowing/watermarking, state management).
The feature store design: online store schema, keys, TTLs, and how you store feature timestamps.
How you ensure and measure feature freshness end-to-end (event time vs ingest time, device clock skew, freshness SLIs/SLOs).
How you handle late events and retractions/updates to previously computed windows.
How you land offline training features in Snowflake with the same definitions and how you validate parity.
Orchestration and operational plan: deployment, backfills, rollbacks, monitoring, and on-call runbooks.

Context

You are asked to design the end-to-end data pipeline to generate and serve churn features for real-time inference and also land the same features for training/analytics.

Scale Requirements

Event throughput: avg 120K events/sec, peak 350K events/sec during prime time
Event size: 0.5–2KB JSON
Daily volume: ~8–15TB raw
Freshness SLO (online features):
- P50 < 30 seconds, P95 < 2 minutes from event time to feature availability
Late data:
- 2–3% of events arrive 5–30 minutes late (mobile offline)
- rare tail up to 24 hours (TV devices buffering + retries)
Serving QPS: churn scoring service calls feature store at 20K QPS peak
Retention:
- raw events in data lake: 90 days
- training features in warehouse: 2 years

Data Characteristics

Key sources

Client events (web/mobile/TV): playback_start, playback_error, search, add_to_watchlist, cancel_flow_view, plan_change_view
Billing events (internal payments system): payment_failed, payment_retried, payment_succeeded
Customer support (Zendesk-like): ticket_created, ticket_resolved, sentiment_score

Example event schema (client events)

field	type	notes
event_id	string	UUID; not always present on older TV clients
user_id	string	stable subscriber id
device_id	string	may rotate
event_type	string	enum
event_ts	timestamp	device timestamp (can skew)
ingest_ts	timestamp	server receive time
attributes	variant/json	error_code, content_id, app_version, etc.

Feature examples needed for churn model

errors_last_10m, errors_last_1h
cancel_page_views_last_30m
payment_failures_last_7d
watch_time_last_24h (incremental)
distinct_titles_last_7d
support_tickets_last_30d

Requirements

Functional

Streaming ingestion of all event sources with ordering guarantees per user where feasible.
Real-time feature computation (sliding windows + incremental aggregates) and online serving to a model scoring service.
Feature freshness management: define and enforce freshness SLOs; expose feature timestamps to the model.
Late-arriving data handling with deterministic updates to windowed features and clear policies for “too-late” events.
Online/offline parity: the same feature definitions should be used for training (Snowflake) and serving (online store).
Backfills and reprocessing: ability to recompute features for a historical period (e.g., model retraining, bug fix) without corrupting online state.
Data quality: schema validation, deduplication, anomaly detection (spikes/drops), and auditability.

Non-functional

Exactly-once or effectively-once semantics for feature updates (no double counting).
High availability: tolerate AZ failure; RPO near-zero for committed events.
Security & compliance: GDPR/CCPA deletion within 72 hours across online store, lake, and warehouse.
Cost constraints: incremental platform cost target <$60K/month.

Constraints

Cloud: AWS. Existing: Kafka (MSK) for clickstream, S3 data lake, Snowflake warehouse, Airflow 2.x, dbt.
Team: 5 data engineers, 2 ML engineers. Strong Spark skills; moderate Kafka; limited Flink experience.
Scoring service is deployed on Kubernetes and can call Redis/DynamoDB/Snowflake, but must keep p95 latency < 50ms for feature fetch.

What You Should Design / Explain

The streaming feature pipeline (topics, partitioning strategy, windowing/watermarking, state management).
The feature store design: online store schema, keys, TTLs, and how you store feature timestamps.
How you ensure and measure feature freshness end-to-end (event time vs ingest time, device clock skew, freshness SLIs/SLOs).
How you handle late events and retractions/updates to previously computed windows.
How you land offline training features in Snowflake with the same definitions and how you validate parity.
Orchestration and operational plan: deployment, backfills, rollbacks, monitoring, and on-call runbooks.

Context

You are asked to design the end-to-end data pipeline to generate and serve churn features for real-time inference and also land the same features for training/analytics.

Scale Requirements

Event throughput: avg 120K events/sec, peak 350K events/sec during prime time
Event size: 0.5–2KB JSON
Daily volume: ~8–15TB raw
Freshness SLO (online features):
- P50 < 30 seconds, P95 < 2 minutes from event time to feature availability
Late data:
- 2–3% of events arrive 5–30 minutes late (mobile offline)
- rare tail up to 24 hours (TV devices buffering + retries)
Serving QPS: churn scoring service calls feature store at 20K QPS peak
Retention:
- raw events in data lake: 90 days
- training features in warehouse: 2 years

Data Characteristics

Key sources

Client events (web/mobile/TV): playback_start, playback_error, search, add_to_watchlist, cancel_flow_view, plan_change_view
Billing events (internal payments system): payment_failed, payment_retried, payment_succeeded
Customer support (Zendesk-like): ticket_created, ticket_resolved, sentiment_score

Example event schema (client events)

field	type	notes
event_id	string	UUID; not always present on older TV clients
user_id	string	stable subscriber id
device_id	string	may rotate
event_type	string	enum
event_ts	timestamp	device timestamp (can skew)
ingest_ts	timestamp	server receive time
attributes	variant/json	error_code, content_id, app_version, etc.

Feature examples needed for churn model

errors_last_10m, errors_last_1h
cancel_page_views_last_30m
payment_failures_last_7d
watch_time_last_24h (incremental)
distinct_titles_last_7d
support_tickets_last_30d

Requirements

Functional

Streaming ingestion of all event sources with ordering guarantees per user where feasible.
Real-time feature computation (sliding windows + incremental aggregates) and online serving to a model scoring service.
Feature freshness management: define and enforce freshness SLOs; expose feature timestamps to the model.
Late-arriving data handling with deterministic updates to windowed features and clear policies for “too-late” events.
Online/offline parity: the same feature definitions should be used for training (Snowflake) and serving (online store).
Backfills and reprocessing: ability to recompute features for a historical period (e.g., model retraining, bug fix) without corrupting online state.
Data quality: schema validation, deduplication, anomaly detection (spikes/drops), and auditability.

Non-functional

Exactly-once or effectively-once semantics for feature updates (no double counting).
High availability: tolerate AZ failure; RPO near-zero for committed events.
Security & compliance: GDPR/CCPA deletion within 72 hours across online store, lake, and warehouse.
Cost constraints: incremental platform cost target <$60K/month.

Constraints

Cloud: AWS. Existing: Kafka (MSK) for clickstream, S3 data lake, Snowflake warehouse, Airflow 2.x, dbt.
Team: 5 data engineers, 2 ML engineers. Strong Spark skills; moderate Kafka; limited Flink experience.
Scoring service is deployed on Kubernetes and can call Redis/DynamoDB/Snowflake, but must keep p95 latency < 50ms for feature fetch.

What You Should Design / Explain

The streaming feature pipeline (topics, partitioning strategy, windowing/watermarking, state management).
The feature store design: online store schema, keys, TTLs, and how you store feature timestamps.
How you ensure and measure feature freshness end-to-end (event time vs ingest time, device clock skew, freshness SLIs/SLOs).
How you handle late events and retractions/updates to previously computed windows.
How you land offline training features in Snowflake with the same definitions and how you validate parity.
Orchestration and operational plan: deployment, backfills, rollbacks, monitoring, and on-call runbooks.

Context

You are asked to design the end-to-end data pipeline to generate and serve churn features for real-time inference and also land the same features for training/analytics.

Scale Requirements

Event throughput: avg 120K events/sec, peak 350K events/sec during prime time
Event size: 0.5–2KB JSON
Daily volume: ~8–15TB raw
Freshness SLO (online features):
- P50 < 30 seconds, P95 < 2 minutes from event time to feature availability
Late data:
- 2–3% of events arrive 5–30 minutes late (mobile offline)
- rare tail up to 24 hours (TV devices buffering + retries)
Serving QPS: churn scoring service calls feature store at 20K QPS peak
Retention:
- raw events in data lake: 90 days
- training features in warehouse: 2 years

Data Characteristics

Key sources

Client events (web/mobile/TV): playback_start, playback_error, search, add_to_watchlist, cancel_flow_view, plan_change_view
Billing events (internal payments system): payment_failed, payment_retried, payment_succeeded
Customer support (Zendesk-like): ticket_created, ticket_resolved, sentiment_score

Example event schema (client events)

field	type	notes
event_id	string	UUID; not always present on older TV clients
user_id	string	stable subscriber id
device_id	string	may rotate
event_type	string	enum
event_ts	timestamp	device timestamp (can skew)
ingest_ts	timestamp	server receive time
attributes	variant/json	error_code, content_id, app_version, etc.

Feature examples needed for churn model

errors_last_10m, errors_last_1h
cancel_page_views_last_30m
payment_failures_last_7d
watch_time_last_24h (incremental)
distinct_titles_last_7d
support_tickets_last_30d

Requirements

Functional

Streaming ingestion of all event sources with ordering guarantees per user where feasible.
Real-time feature computation (sliding windows + incremental aggregates) and online serving to a model scoring service.
Feature freshness management: define and enforce freshness SLOs; expose feature timestamps to the model.
Late-arriving data handling with deterministic updates to windowed features and clear policies for “too-late” events.
Online/offline parity: the same feature definitions should be used for training (Snowflake) and serving (online store).
Backfills and reprocessing: ability to recompute features for a historical period (e.g., model retraining, bug fix) without corrupting online state.
Data quality: schema validation, deduplication, anomaly detection (spikes/drops), and auditability.

Non-functional

Exactly-once or effectively-once semantics for feature updates (no double counting).
High availability: tolerate AZ failure; RPO near-zero for committed events.
Security & compliance: GDPR/CCPA deletion within 72 hours across online store, lake, and warehouse.
Cost constraints: incremental platform cost target <$60K/month.

Constraints

Cloud: AWS. Existing: Kafka (MSK) for clickstream, S3 data lake, Snowflake warehouse, Airflow 2.x, dbt.
Team: 5 data engineers, 2 ML engineers. Strong Spark skills; moderate Kafka; limited Flink experience.
Scoring service is deployed on Kubernetes and can call Redis/DynamoDB/Snowflake, but must keep p95 latency < 50ms for feature fetch.

What You Should Design / Explain

The streaming feature pipeline (topics, partitioning strategy, windowing/watermarking, state management).
The feature store design: online store schema, keys, TTLs, and how you store feature timestamps.
How you ensure and measure feature freshness end-to-end (event time vs ingest time, device clock skew, freshness SLIs/SLOs).
How you handle late events and retractions/updates to previously computed windows.
How you land offline training features in Snowflake with the same definitions and how you validate parity.
Orchestration and operational plan: deployment, backfills, rollbacks, monitoring, and on-call runbooks.

Interview Guides

Context

Scale Requirements

Data Characteristics

Key sources

Example event schema (client events)

Feature examples needed for churn model

Requirements

Functional

Non-functional

Constraints

What You Should Design / Explain

Real-Time Churn Features Freshness

Context

Scale Requirements

Data Characteristics

Key sources

Example event schema (client events)

Feature examples needed for churn model

Requirements

Functional

Non-functional

Constraints

What You Should Design / Explain

Your Answer

Real-Time Churn Features Freshness

Context

Scale Requirements

Data Characteristics

Key sources

Example event schema (client events)

Feature examples needed for churn model

Requirements

Functional

Non-functional

Constraints

What You Should Design / Explain

Real-Time Churn Features Freshness

Context

Scale Requirements

Data Characteristics

Key sources

Example event schema (client events)

Feature examples needed for churn model

Requirements

Functional

Non-functional

Constraints

What You Should Design / Explain

Your Answer