Context
NovaMind builds recommendation and ranking models for a consumer marketplace. Today, feature generation and training data assembly run as separate daily batch jobs on S3, which causes stale features, inconsistent labels, and frequent train/serve skew when models are retrained.
You need to design a production-grade data pipeline that supports both batch retraining and near-real-time feature updates for AI models, while preserving lineage, quality checks, and reproducibility.
Scale Requirements
- Sources: application PostgreSQL CDC, mobile/web event streams, and third-party catalog feeds
- Throughput: 180K events/sec peak, 35K events/sec average
- Daily volume: ~6 TB raw JSON/Avro per day
- Latency: online features available in < 2 minutes; batch training dataset ready by 6 AM UTC
- Retention: raw data 180 days, curated features 2 years, training snapshots 1 year
- Model cadence: 12 model retrains/day across ranking, fraud, and personalization use cases
Requirements
- Ingest CDC and event data with ordered, replayable delivery for downstream feature computation.
- Build a unified pipeline for batch and streaming transformations so feature definitions are consistent across training and serving.
- Implement schema validation, deduplication, late-arrival handling, and data quality checks before data reaches feature tables.
- Produce versioned training datasets with point-in-time correct joins to avoid label leakage.
- Orchestrate retraining dependencies across ingestion, transformation, validation, and dataset publication.
- Provide monitoring for freshness, completeness, cost, and pipeline failures.
- Support backfills for 90 days without corrupting current production tables.
Constraints
- Infrastructure must stay on AWS and use the existing S3 data lake.
- Team has strong SQL/Airflow experience but limited Flink expertise.
- Incremental platform budget is capped at $40K/month.
- PII must be encrypted at rest and deleted within 30 days of user deletion requests.
- Downstream ML teams require reproducible datasets and immutable snapshot references for audits.