Context
You’re interviewing with the Retention & Growth Engineering team at StreamWave, a global subscription video streaming company (think Netflix-scale). StreamWave has 20M paying subscribers, 6M DAU, and runs churn prevention campaigns (in-app offers, email, push notifications) that are most effective when triggered within minutes of a user showing “churn intent” (e.g., repeated playback failures, cancellation page views, payment retries).
Today, churn prediction is trained weekly using batch features in Snowflake (dbt models on top of event tables). Scoring happens in a daily batch, and the CRM team receives a list of “at-risk users” the next morning. This is too slow: the business wants real-time scoring so that a user who experiences multiple failures in the last 10 minutes can be targeted immediately. The ML team already has a model that can run online (a lightweight gradient-boosted model served behind an internal API), but they lack a production-grade feature pipeline that guarantees feature freshness, handles late-arriving events, and maintains online/offline consistency.
You are asked to design the end-to-end data pipeline to generate and serve churn features for real-time inference and also land the same features for training/analytics.
Scale Requirements
- Event throughput: avg 120K events/sec, peak 350K events/sec during prime time
- Event size: 0.5–2KB JSON
- Daily volume: ~8–15TB raw
- Freshness SLO (online features):
- P50 < 30 seconds, P95 < 2 minutes from event time to feature availability
- Late data:
- 2–3% of events arrive 5–30 minutes late (mobile offline)
- rare tail up to 24 hours (TV devices buffering + retries)
- Serving QPS: churn scoring service calls feature store at 20K QPS peak
- Retention:
- raw events in data lake: 90 days
- training features in warehouse: 2 years
Data Characteristics
Key sources
- Client events (web/mobile/TV): playback_start, playback_error, search, add_to_watchlist, cancel_flow_view, plan_change_view
- Billing events (internal payments system): payment_failed, payment_retried, payment_succeeded
- Customer support (Zendesk-like): ticket_created, ticket_resolved, sentiment_score
Example event schema (client events)
| field | type | notes |
|---|
| event_id | string | UUID; not always present on older TV clients |
| user_id | string | stable subscriber id |
| device_id | string | may rotate |
| event_type | string | enum |
| event_ts | timestamp | device timestamp (can skew) |
| ingest_ts | timestamp | server receive time |
| attributes | variant/json | error_code, content_id, app_version, etc. |
Feature examples needed for churn model
errors_last_10m, errors_last_1h
cancel_page_views_last_30m
payment_failures_last_7d
watch_time_last_24h (incremental)
distinct_titles_last_7d
support_tickets_last_30d
Requirements
Functional
- Streaming ingestion of all event sources with ordering guarantees per user where feasible.
- Real-time feature computation (sliding windows + incremental aggregates) and online serving to a model scoring service.
- Feature freshness management: define and enforce freshness SLOs; expose feature timestamps to the model.
- Late-arriving data handling with deterministic updates to windowed features and clear policies for “too-late” events.
- Online/offline parity: the same feature definitions should be used for training (Snowflake) and serving (online store).
- Backfills and reprocessing: ability to recompute features for a historical period (e.g., model retraining, bug fix) without corrupting online state.
- Data quality: schema validation, deduplication, anomaly detection (spikes/drops), and auditability.
Non-functional
- Exactly-once or effectively-once semantics for feature updates (no double counting).
- High availability: tolerate AZ failure; RPO near-zero for committed events.
- Security & compliance: GDPR/CCPA deletion within 72 hours across online store, lake, and warehouse.
- Cost constraints: incremental platform cost target <$60K/month.
Constraints
- Cloud: AWS. Existing: Kafka (MSK) for clickstream, S3 data lake, Snowflake warehouse, Airflow 2.x, dbt.
- Team: 5 data engineers, 2 ML engineers. Strong Spark skills; moderate Kafka; limited Flink experience.
- Scoring service is deployed on Kubernetes and can call Redis/DynamoDB/Snowflake, but must keep p95 latency < 50ms for feature fetch.
What You Should Design / Explain
- The streaming feature pipeline (topics, partitioning strategy, windowing/watermarking, state management).
- The feature store design: online store schema, keys, TTLs, and how you store feature timestamps.
- How you ensure and measure feature freshness end-to-end (event time vs ingest time, device clock skew, freshness SLIs/SLOs).
- How you handle late events and retractions/updates to previously computed windows.
- How you land offline training features in Snowflake with the same definitions and how you validate parity.
- Orchestration and operational plan: deployment, backfills, rollbacks, monitoring, and on-call runbooks.