Context
AcmeRisk, a fintech company, trains fraud detection models weekly but deploys them manually through ad hoc scripts and ticket-based handoffs. The current process causes inconsistent model artifacts, missing lineage, and delayed rollbacks when a bad model reaches production.
You need to design a production-grade data pipeline that moves trained models from the ML training environment into the broader software ecosystem: batch scoring jobs, a low-latency online inference service, monitoring tables, and downstream analytics. The company already uses AWS, Airflow, S3, Docker, Kubernetes, and Snowflake.
Scale Requirements
- Training output: 20 model candidates/day, each artifact 200MB-1.5GB
- Online traffic: 8K predictions/sec average, 25K/sec peak
- Batch scoring: 120M transactions/day, SLA < 2 hours
- Deployment latency: approved model available for online serving in < 15 minutes
- Retention: model artifacts and metadata retained for 1 year
- Availability target: 99.9% for inference endpoints
Requirements
- Design an ETL/ELT-style deployment pipeline that ingests model artifacts, validation reports, and metadata from training jobs.
- Register versioned models and promote them through staging and production with reproducible lineage.
- Support both online serving on Kubernetes and batch scoring pipelines writing results to Snowflake.
- Include automated validation gates: schema checks, feature compatibility, performance thresholds, and canary deployment checks.
- Ensure idempotent deployments, rollback support, and auditability for compliance reviews.
- Expose deployment status and model health to engineering and data teams.
Constraints
- AWS-first environment; no multi-cloud design
- Small platform team: 3 data engineers, 2 ML engineers
- Monthly incremental infrastructure budget: $18K
- Must satisfy SOC 2 audit requirements with immutable deployment logs
- Feature definitions are maintained separately in a dbt/Snowflake analytics stack, so training-serving skew must be detected automatically