Context
FinEdge, a mid-sized fintech company, currently runs on-premise SQL Server Integration Services (SSIS) jobs to move transactional, customer, and billing data into a reporting database once per day. The company is migrating to the cloud and wants a modern pipeline on either AWS or Azure that improves reliability, supports incremental processing, and reduces the current 10-hour reporting delay.
You are asked to design a cloud-native batch ETL platform that ingests data from operational databases, object storage, and SaaS APIs into an analytics warehouse.
Scale Requirements
- Sources: 12 PostgreSQL/MySQL databases, 4 SaaS APIs, and daily CSV drops
- Volume: 2.5 TB/day raw data, ~18 billion rows total historical backfill
- Batch frequency: Hourly incremental loads for critical tables; daily full loads for low-priority sources
- Latency target: Critical datasets queryable within 30 minutes of source update
- Retention: 1 year raw zone, 5 years curated warehouse data
Requirements
- Design a cloud architecture on AWS or Azure for ingesting relational, file-based, and API data sources.
- Support both historical backfills and incremental loads using CDC timestamps or change tracking.
- Implement transformation layers for raw, cleaned, and curated datasets.
- Orchestrate dependencies between ingestion, validation, transformation, and warehouse loading jobs.
- Define how you would enforce idempotency, schema evolution handling, and data quality checks.
- Provide monitoring for job failures, SLA breaches, row-count anomalies, and cost spikes.
- Explain how analytics users will access trusted tables in the warehouse.
Constraints
- Existing team has stronger SQL/Python skills than Spark expertise.
- Budget target is <$18K/month incremental cloud spend.
- PII data must be encrypted at rest and in transit.
- The design should prefer managed services over self-hosted infrastructure where possible.
- The company may standardize on either AWS or Azure, so your design should justify the platform choice clearly.