Design AI Training Data Pipeline

Context

NovaMind builds recommendation and ranking models for a consumer marketplace. Today, feature generation and training data assembly run as separate daily batch jobs on S3, which causes stale features, inconsistent labels, and frequent train/serve skew when models are retrained.

You need to design a production-grade data pipeline that supports both batch retraining and near-real-time feature updates for AI models, while preserving lineage, quality checks, and reproducibility.

Scale Requirements

Sources: application PostgreSQL CDC, mobile/web event streams, and third-party catalog feeds
Throughput: 180K events/sec peak, 35K events/sec average
Daily volume: ~6 TB raw JSON/Avro per day
Latency: online features available in < 2 minutes; batch training dataset ready by 6 AM UTC
Retention: raw data 180 days, curated features 2 years, training snapshots 1 year
Model cadence: 12 model retrains/day across ranking, fraud, and personalization use cases

Requirements

Ingest CDC and event data with ordered, replayable delivery for downstream feature computation.
Build a unified pipeline for batch and streaming transformations so feature definitions are consistent across training and serving.
Implement schema validation, deduplication, late-arrival handling, and data quality checks before data reaches feature tables.
Produce versioned training datasets with point-in-time correct joins to avoid label leakage.
Orchestrate retraining dependencies across ingestion, transformation, validation, and dataset publication.
Provide monitoring for freshness, completeness, cost, and pipeline failures.
Support backfills for 90 days without corrupting current production tables.

Constraints

Infrastructure must stay on AWS and use the existing S3 data lake.
Team has strong SQL/Airflow experience but limited Flink expertise.
Incremental platform budget is capped at $40K/month.
PII must be encrypted at rest and deleted within 30 days of user deletion requests.
Downstream ML teams require reproducible datasets and immutable snapshot references for audits.

Context

Scale Requirements

Sources: application PostgreSQL CDC, mobile/web event streams, and third-party catalog feeds
Throughput: 180K events/sec peak, 35K events/sec average
Daily volume: ~6 TB raw JSON/Avro per day
Latency: online features available in < 2 minutes; batch training dataset ready by 6 AM UTC
Retention: raw data 180 days, curated features 2 years, training snapshots 1 year
Model cadence: 12 model retrains/day across ranking, fraud, and personalization use cases

Requirements

Ingest CDC and event data with ordered, replayable delivery for downstream feature computation.
Build a unified pipeline for batch and streaming transformations so feature definitions are consistent across training and serving.
Implement schema validation, deduplication, late-arrival handling, and data quality checks before data reaches feature tables.
Produce versioned training datasets with point-in-time correct joins to avoid label leakage.
Orchestrate retraining dependencies across ingestion, transformation, validation, and dataset publication.
Provide monitoring for freshness, completeness, cost, and pipeline failures.
Support backfills for 90 days without corrupting current production tables.

Constraints

Infrastructure must stay on AWS and use the existing S3 data lake.
Team has strong SQL/Airflow experience but limited Flink expertise.
Incremental platform budget is capped at $40K/month.
PII must be encrypted at rest and deleted within 30 days of user deletion requests.
Downstream ML teams require reproducible datasets and immutable snapshot references for audits.

Context

Scale Requirements

Sources: application PostgreSQL CDC, mobile/web event streams, and third-party catalog feeds
Throughput: 180K events/sec peak, 35K events/sec average
Daily volume: ~6 TB raw JSON/Avro per day
Latency: online features available in < 2 minutes; batch training dataset ready by 6 AM UTC
Retention: raw data 180 days, curated features 2 years, training snapshots 1 year
Model cadence: 12 model retrains/day across ranking, fraud, and personalization use cases

Requirements

Ingest CDC and event data with ordered, replayable delivery for downstream feature computation.
Build a unified pipeline for batch and streaming transformations so feature definitions are consistent across training and serving.
Implement schema validation, deduplication, late-arrival handling, and data quality checks before data reaches feature tables.
Produce versioned training datasets with point-in-time correct joins to avoid label leakage.
Orchestrate retraining dependencies across ingestion, transformation, validation, and dataset publication.
Provide monitoring for freshness, completeness, cost, and pipeline failures.
Support backfills for 90 days without corrupting current production tables.

Constraints

Infrastructure must stay on AWS and use the existing S3 data lake.
Team has strong SQL/Airflow experience but limited Flink expertise.
Incremental platform budget is capped at $40K/month.
PII must be encrypted at rest and deleted within 30 days of user deletion requests.
Downstream ML teams require reproducible datasets and immutable snapshot references for audits.

Context

Scale Requirements

Sources: application PostgreSQL CDC, mobile/web event streams, and third-party catalog feeds
Throughput: 180K events/sec peak, 35K events/sec average
Daily volume: ~6 TB raw JSON/Avro per day
Latency: online features available in < 2 minutes; batch training dataset ready by 6 AM UTC
Retention: raw data 180 days, curated features 2 years, training snapshots 1 year
Model cadence: 12 model retrains/day across ranking, fraud, and personalization use cases

Requirements

Ingest CDC and event data with ordered, replayable delivery for downstream feature computation.
Build a unified pipeline for batch and streaming transformations so feature definitions are consistent across training and serving.
Implement schema validation, deduplication, late-arrival handling, and data quality checks before data reaches feature tables.
Produce versioned training datasets with point-in-time correct joins to avoid label leakage.
Orchestrate retraining dependencies across ingestion, transformation, validation, and dataset publication.
Provide monitoring for freshness, completeness, cost, and pipeline failures.
Support backfills for 90 days without corrupting current production tables.

Constraints

Infrastructure must stay on AWS and use the existing S3 data lake.
Team has strong SQL/Airflow experience but limited Flink expertise.
Incremental platform budget is capped at $40K/month.
PII must be encrypted at rest and deleted within 30 days of user deletion requests.
Downstream ML teams require reproducible datasets and immutable snapshot references for audits.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design AI Training Data Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design AI Training Data Pipeline

Context

Scale Requirements

Requirements

Constraints

Design AI Training Data Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer