Build Feature Store Pipelines at Scale

Context

You’re interviewing with the Fraud & Risk org at PayWave, a global fintech processing card-not-present payments for ~120K merchants. PayWave’s fraud model scores every authorization in real time; a 20–30 bps lift in fraud detection translates to $40M+/year in prevented chargebacks and reduced manual review.

Today, the team has a mix of ad-hoc feature jobs: some features are computed in daily Spark batch and landed in Snowflake; others are computed in a Kafka consumer and stored in Redis. The result is inconsistent definitions, frequent training/serving skew, and recurring incidents where features silently drop to null after upstream schema changes. Leadership wants a single, governed feature pipeline that produces both offline training datasets and online low-latency features with strong correctness guarantees.

Scale Requirements

Event throughput: average 600K events/sec, peak 2.5M events/sec during regional sales
Daily volume: 50B events/day across auths, device signals, merchant events, and user behavior (30–60 TB/day compressed)
Latency targets:
- Online features: P95 < 10 minutes from event time to availability in online store
- Offline features: hourly partitions available within 30 minutes of hour close
Late/out-of-order data:
- 95% within 5 minutes, long tail up to 48 hours (mobile device telemetry and partner feeds)
Retention:
- Raw immutable events: 180 days
- Offline feature tables: 2 years
- Online feature store: 30 days (rolling)

Data Characteristics

Key sources

Source	Transport	Example events	Common issues
Authorization events	Kafka	auth_created, auth_decision	duplicates on retries, occasional missing fields
Device telemetry	Kafka	device_fingerprint, IP changes	late arrival, high cardinality
Merchant catalog & risk config	CDC (Debezium)	merchant_tier, MCC, risk_rules	schema evolution, backfills
Chargeback outcomes	Batch S3 drop	chargeback_opened/closed	very late labels (weeks), partial files

Example feature families you must support

Velocity features: txn_count_5m, txn_amount_sum_1h per user_id / card_hash / device_id
Entity graph features: distinct_cards_per_device_24h, distinct_devices_per_merchant_7d
Aggregation with joins: merchant tier, MCC risk, user segment
Label joins for training: chargeback outcomes joined to past auths with point-in-time correctness

Your Task

Design a complete feature engineering pipeline for big data that produces:

Online features for real-time scoring (low latency, consistent keys)
Offline feature tables for training and backtesting (point-in-time correct)
A governed feature registry (definitions, owners, SLAs, lineage)

Functional requirements

Compute streaming aggregations (windowed + stateful) for velocity and distinct-count features.
Handle late-arriving events up to 48 hours without corrupting aggregates; define what happens beyond the allowed lateness.
Guarantee idempotency and deduplication for retried events (e.g., auth retries, Kafka replays).
Support feature backfills (e.g., recompute last 90 days after a bug fix) without breaking online serving.
Provide point-in-time correctness for offline training datasets (no label leakage; features as-of event time).
Implement data quality checks: schema validation, null/volume anomalies, distribution drift, and join coverage.
Provide a consistent feature definition so the same logic is used for offline and online (or explain controlled divergence).

Non-functional requirements

Reliability: 99.9% pipeline availability; automatic recovery from transient failures.
Observability: per-feature freshness, completeness, and error budgets.
Cost: incremental platform spend capped at $120K/month.
Compliance: PCI/PII constraints—card PAN never stored; only tokenized hashes; GDPR deletion within 72 hours.

Constraints

Cloud: AWS. Existing investments: Kafka (MSK), S3, Snowflake, Airflow, and Spark on EMR.
Team: 6 data engineers, 2 ML engineers. Strong Spark/SQL; moderate Kafka; limited Flink experience.
You may introduce dbt and a schema registry, but avoid adopting a brand-new large platform unless justified.

What we’re evaluating

Explain the techniques you’d use for feature engineering on big data specifically in a pipelines context:

How you design streaming window/state and choose watermarks
How you model feature tables (keys, time, versioning)
How you ensure training/serving parity and point-in-time joins
How you operationalize quality, backfills, and schema evolution
How you monitor and handle failures at production scale

Context

Scale Requirements

Event throughput: average 600K events/sec, peak 2.5M events/sec during regional sales
Daily volume: 50B events/day across auths, device signals, merchant events, and user behavior (30–60 TB/day compressed)
Latency targets:
- Online features: P95 < 10 minutes from event time to availability in online store
- Offline features: hourly partitions available within 30 minutes of hour close
Late/out-of-order data:
- 95% within 5 minutes, long tail up to 48 hours (mobile device telemetry and partner feeds)
Retention:
- Raw immutable events: 180 days
- Offline feature tables: 2 years
- Online feature store: 30 days (rolling)

Data Characteristics

Key sources

Source	Transport	Example events	Common issues
Authorization events	Kafka	auth_created, auth_decision	duplicates on retries, occasional missing fields
Device telemetry	Kafka	device_fingerprint, IP changes	late arrival, high cardinality
Merchant catalog & risk config	CDC (Debezium)	merchant_tier, MCC, risk_rules	schema evolution, backfills
Chargeback outcomes	Batch S3 drop	chargeback_opened/closed	very late labels (weeks), partial files

Example feature families you must support

Velocity features: txn_count_5m, txn_amount_sum_1h per user_id / card_hash / device_id
Entity graph features: distinct_cards_per_device_24h, distinct_devices_per_merchant_7d
Aggregation with joins: merchant tier, MCC risk, user segment
Label joins for training: chargeback outcomes joined to past auths with point-in-time correctness

Your Task

Design a complete feature engineering pipeline for big data that produces:

Online features for real-time scoring (low latency, consistent keys)
Offline feature tables for training and backtesting (point-in-time correct)
A governed feature registry (definitions, owners, SLAs, lineage)

Functional requirements

Compute streaming aggregations (windowed + stateful) for velocity and distinct-count features.
Handle late-arriving events up to 48 hours without corrupting aggregates; define what happens beyond the allowed lateness.
Guarantee idempotency and deduplication for retried events (e.g., auth retries, Kafka replays).
Support feature backfills (e.g., recompute last 90 days after a bug fix) without breaking online serving.
Provide point-in-time correctness for offline training datasets (no label leakage; features as-of event time).
Implement data quality checks: schema validation, null/volume anomalies, distribution drift, and join coverage.
Provide a consistent feature definition so the same logic is used for offline and online (or explain controlled divergence).

Non-functional requirements

Reliability: 99.9% pipeline availability; automatic recovery from transient failures.
Observability: per-feature freshness, completeness, and error budgets.
Cost: incremental platform spend capped at $120K/month.
Compliance: PCI/PII constraints—card PAN never stored; only tokenized hashes; GDPR deletion within 72 hours.

Constraints

Cloud: AWS. Existing investments: Kafka (MSK), S3, Snowflake, Airflow, and Spark on EMR.
Team: 6 data engineers, 2 ML engineers. Strong Spark/SQL; moderate Kafka; limited Flink experience.
You may introduce dbt and a schema registry, but avoid adopting a brand-new large platform unless justified.

What we’re evaluating

Explain the techniques you’d use for feature engineering on big data specifically in a pipelines context:

How you design streaming window/state and choose watermarks
How you model feature tables (keys, time, versioning)
How you ensure training/serving parity and point-in-time joins
How you operationalize quality, backfills, and schema evolution
How you monitor and handle failures at production scale

Context

Scale Requirements

Event throughput: average 600K events/sec, peak 2.5M events/sec during regional sales
Daily volume: 50B events/day across auths, device signals, merchant events, and user behavior (30–60 TB/day compressed)
Latency targets:
- Online features: P95 < 10 minutes from event time to availability in online store
- Offline features: hourly partitions available within 30 minutes of hour close
Late/out-of-order data:
- 95% within 5 minutes, long tail up to 48 hours (mobile device telemetry and partner feeds)
Retention:
- Raw immutable events: 180 days
- Offline feature tables: 2 years
- Online feature store: 30 days (rolling)

Data Characteristics

Key sources

Source	Transport	Example events	Common issues
Authorization events	Kafka	auth_created, auth_decision	duplicates on retries, occasional missing fields
Device telemetry	Kafka	device_fingerprint, IP changes	late arrival, high cardinality
Merchant catalog & risk config	CDC (Debezium)	merchant_tier, MCC, risk_rules	schema evolution, backfills
Chargeback outcomes	Batch S3 drop	chargeback_opened/closed	very late labels (weeks), partial files

Example feature families you must support

Velocity features: txn_count_5m, txn_amount_sum_1h per user_id / card_hash / device_id
Entity graph features: distinct_cards_per_device_24h, distinct_devices_per_merchant_7d
Aggregation with joins: merchant tier, MCC risk, user segment
Label joins for training: chargeback outcomes joined to past auths with point-in-time correctness

Your Task

Design a complete feature engineering pipeline for big data that produces:

Online features for real-time scoring (low latency, consistent keys)
Offline feature tables for training and backtesting (point-in-time correct)
A governed feature registry (definitions, owners, SLAs, lineage)

Functional requirements

Compute streaming aggregations (windowed + stateful) for velocity and distinct-count features.
Handle late-arriving events up to 48 hours without corrupting aggregates; define what happens beyond the allowed lateness.
Guarantee idempotency and deduplication for retried events (e.g., auth retries, Kafka replays).
Support feature backfills (e.g., recompute last 90 days after a bug fix) without breaking online serving.
Provide point-in-time correctness for offline training datasets (no label leakage; features as-of event time).
Implement data quality checks: schema validation, null/volume anomalies, distribution drift, and join coverage.
Provide a consistent feature definition so the same logic is used for offline and online (or explain controlled divergence).

Non-functional requirements

Reliability: 99.9% pipeline availability; automatic recovery from transient failures.
Observability: per-feature freshness, completeness, and error budgets.
Cost: incremental platform spend capped at $120K/month.
Compliance: PCI/PII constraints—card PAN never stored; only tokenized hashes; GDPR deletion within 72 hours.

Constraints

Cloud: AWS. Existing investments: Kafka (MSK), S3, Snowflake, Airflow, and Spark on EMR.
Team: 6 data engineers, 2 ML engineers. Strong Spark/SQL; moderate Kafka; limited Flink experience.
You may introduce dbt and a schema registry, but avoid adopting a brand-new large platform unless justified.

What we’re evaluating

Explain the techniques you’d use for feature engineering on big data specifically in a pipelines context:

How you design streaming window/state and choose watermarks
How you model feature tables (keys, time, versioning)
How you ensure training/serving parity and point-in-time joins
How you operationalize quality, backfills, and schema evolution
How you monitor and handle failures at production scale

Context

Scale Requirements

Event throughput: average 600K events/sec, peak 2.5M events/sec during regional sales
Daily volume: 50B events/day across auths, device signals, merchant events, and user behavior (30–60 TB/day compressed)
Latency targets:
- Online features: P95 < 10 minutes from event time to availability in online store
- Offline features: hourly partitions available within 30 minutes of hour close
Late/out-of-order data:
- 95% within 5 minutes, long tail up to 48 hours (mobile device telemetry and partner feeds)
Retention:
- Raw immutable events: 180 days
- Offline feature tables: 2 years
- Online feature store: 30 days (rolling)

Data Characteristics

Key sources

Source	Transport	Example events	Common issues
Authorization events	Kafka	auth_created, auth_decision	duplicates on retries, occasional missing fields
Device telemetry	Kafka	device_fingerprint, IP changes	late arrival, high cardinality
Merchant catalog & risk config	CDC (Debezium)	merchant_tier, MCC, risk_rules	schema evolution, backfills
Chargeback outcomes	Batch S3 drop	chargeback_opened/closed	very late labels (weeks), partial files

Example feature families you must support

Velocity features: txn_count_5m, txn_amount_sum_1h per user_id / card_hash / device_id
Entity graph features: distinct_cards_per_device_24h, distinct_devices_per_merchant_7d
Aggregation with joins: merchant tier, MCC risk, user segment
Label joins for training: chargeback outcomes joined to past auths with point-in-time correctness

Your Task

Design a complete feature engineering pipeline for big data that produces:

Online features for real-time scoring (low latency, consistent keys)
Offline feature tables for training and backtesting (point-in-time correct)
A governed feature registry (definitions, owners, SLAs, lineage)

Functional requirements

Compute streaming aggregations (windowed + stateful) for velocity and distinct-count features.
Handle late-arriving events up to 48 hours without corrupting aggregates; define what happens beyond the allowed lateness.
Guarantee idempotency and deduplication for retried events (e.g., auth retries, Kafka replays).
Support feature backfills (e.g., recompute last 90 days after a bug fix) without breaking online serving.
Provide point-in-time correctness for offline training datasets (no label leakage; features as-of event time).
Implement data quality checks: schema validation, null/volume anomalies, distribution drift, and join coverage.
Provide a consistent feature definition so the same logic is used for offline and online (or explain controlled divergence).

Non-functional requirements

Reliability: 99.9% pipeline availability; automatic recovery from transient failures.
Observability: per-feature freshness, completeness, and error budgets.
Cost: incremental platform spend capped at $120K/month.
Compliance: PCI/PII constraints—card PAN never stored; only tokenized hashes; GDPR deletion within 72 hours.

Constraints

Cloud: AWS. Existing investments: Kafka (MSK), S3, Snowflake, Airflow, and Spark on EMR.
Team: 6 data engineers, 2 ML engineers. Strong Spark/SQL; moderate Kafka; limited Flink experience.
You may introduce dbt and a schema registry, but avoid adopting a brand-new large platform unless justified.

What we’re evaluating

Explain the techniques you’d use for feature engineering on big data specifically in a pipelines context:

How you design streaming window/state and choose watermarks
How you model feature tables (keys, time, versioning)
How you ensure training/serving parity and point-in-time joins
How you operationalize quality, backfills, and schema evolution
How you monitor and handle failures at production scale

Interview Guides

Context

Scale Requirements

Data Characteristics

Key sources

Example feature families you must support

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Build Feature Store Pipelines at Scale

Context

Scale Requirements

Data Characteristics

Key sources

Example feature families you must support

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Your Answer

Build Feature Store Pipelines at Scale

Context

Scale Requirements

Data Characteristics

Key sources

Example feature families you must support

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Build Feature Store Pipelines at Scale

Context

Scale Requirements

Data Characteristics

Key sources

Example feature families you must support

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Your Answer