Context
PixelForge, a collaborative design platform, stores uploaded design files in Amazon S3 and runs downstream processing to generate thumbnails, extract metadata, and build search indexes. Today, these jobs are triggered manually or by ad hoc cron scripts on EC2, causing missed runs, duplicate processing, and poor visibility into failures.
You need to design a recurring batch pipeline that schedules and orchestrates processing jobs for newly uploaded and updated design files while supporting retries, backfills, and operational monitoring.
Scale Requirements
- Input volume: 8M design files total, 250K new or updated files/day
- File size: 5 MB average, 200 MB max
- Job frequency: Every 15 minutes for incremental processing; nightly full reconciliation
- Latency target: New files processed and queryable within 30 minutes of upload
- Storage: 40 TB raw files in S3, 2 TB metadata in warehouse
- Availability: 99.9% successful scheduled runs per month
Requirements
- Design a scheduler for recurring incremental jobs that discovers new or changed files since the last successful run.
- Orchestrate dependent steps: file discovery, metadata extraction, thumbnail rendering, quality validation, and warehouse load.
- Ensure idempotent re-runs so retries or backfills do not create duplicate metadata or duplicate thumbnails.
- Support backfilling a date range when a downstream system is unavailable for several hours.
- Track job state, run history, and per-step success/failure for operators.
- Load analytics-ready metadata into Snowflake for downstream reporting.
- Include monitoring, alerting, and failure recovery for delayed or failed schedules.
Constraints
- Existing stack is AWS-first: S3, ECS, Snowflake, and CloudWatch are already approved.
- Team size is 3 data engineers; avoid overly complex self-managed infrastructure.
- Budget for new orchestration infrastructure is limited to $15K/month.
- Some files contain customer IP and must remain in AWS with audit logs retained for 1 year.