Context
You’re interviewing with the Fraud & Risk org at PayWave, a global fintech processing card-not-present payments for ~120K merchants. PayWave’s fraud model scores every authorization in real time; a 20–30 bps lift in fraud detection translates to $40M+/year in prevented chargebacks and reduced manual review.
Today, the team has a mix of ad-hoc feature jobs: some features are computed in daily Spark batch and landed in Snowflake; others are computed in a Kafka consumer and stored in Redis. The result is inconsistent definitions, frequent training/serving skew, and recurring incidents where features silently drop to null after upstream schema changes. Leadership wants a single, governed feature pipeline that produces both offline training datasets and online low-latency features with strong correctness guarantees.
Scale Requirements
- Event throughput: average 600K events/sec, peak 2.5M events/sec during regional sales
- Daily volume:
50B events/day across auths, device signals, merchant events, and user behavior (30–60 TB/day compressed)
- Latency targets:
- Online features: P95 < 10 minutes from event time to availability in online store
- Offline features: hourly partitions available within 30 minutes of hour close
- Late/out-of-order data:
- 95% within 5 minutes, long tail up to 48 hours (mobile device telemetry and partner feeds)
- Retention:
- Raw immutable events: 180 days
- Offline feature tables: 2 years
- Online feature store: 30 days (rolling)
Data Characteristics
Key sources
| Source | Transport | Example events | Common issues |
|---|
| Authorization events | Kafka | auth_created, auth_decision | duplicates on retries, occasional missing fields |
| Device telemetry | Kafka | device_fingerprint, IP changes | late arrival, high cardinality |
| Merchant catalog & risk config | CDC (Debezium) | merchant_tier, MCC, risk_rules | schema evolution, backfills |
| Chargeback outcomes | Batch S3 drop | chargeback_opened/closed | very late labels (weeks), partial files |
Example feature families you must support
- Velocity features:
txn_count_5m, txn_amount_sum_1h per user_id / card_hash / device_id
- Entity graph features:
distinct_cards_per_device_24h, distinct_devices_per_merchant_7d
- Aggregation with joins: merchant tier, MCC risk, user segment
- Label joins for training: chargeback outcomes joined to past auths with point-in-time correctness
Your Task
Design a complete feature engineering pipeline for big data that produces:
- Online features for real-time scoring (low latency, consistent keys)
- Offline feature tables for training and backtesting (point-in-time correct)
- A governed feature registry (definitions, owners, SLAs, lineage)
Functional requirements
- Compute streaming aggregations (windowed + stateful) for velocity and distinct-count features.
- Handle late-arriving events up to 48 hours without corrupting aggregates; define what happens beyond the allowed lateness.
- Guarantee idempotency and deduplication for retried events (e.g., auth retries, Kafka replays).
- Support feature backfills (e.g., recompute last 90 days after a bug fix) without breaking online serving.
- Provide point-in-time correctness for offline training datasets (no label leakage; features as-of event time).
- Implement data quality checks: schema validation, null/volume anomalies, distribution drift, and join coverage.
- Provide a consistent feature definition so the same logic is used for offline and online (or explain controlled divergence).
Non-functional requirements
- Reliability: 99.9% pipeline availability; automatic recovery from transient failures.
- Observability: per-feature freshness, completeness, and error budgets.
- Cost: incremental platform spend capped at $120K/month.
- Compliance: PCI/PII constraints—card PAN never stored; only tokenized hashes; GDPR deletion within 72 hours.
Constraints
- Cloud: AWS. Existing investments: Kafka (MSK), S3, Snowflake, Airflow, and Spark on EMR.
- Team: 6 data engineers, 2 ML engineers. Strong Spark/SQL; moderate Kafka; limited Flink experience.
- You may introduce dbt and a schema registry, but avoid adopting a brand-new large platform unless justified.
What we’re evaluating
Explain the techniques you’d use for feature engineering on big data specifically in a pipelines context:
- How you design streaming window/state and choose watermarks
- How you model feature tables (keys, time, versioning)
- How you ensure training/serving parity and point-in-time joins
- How you operationalize quality, backfills, and schema evolution
- How you monitor and handle failures at production scale