Context
FinSight, a B2B payments analytics company, currently runs ad hoc Python ETL jobs on EC2 instances managed separately by the data engineering and DevOps teams. Pipelines frequently fail during deployments, infrastructure changes, and schema updates because ownership boundaries, observability, and recovery procedures are unclear.
You need to design a robust batch-first pipeline platform that data engineers and DevOps can jointly operate. The goal is to standardize ingestion, orchestration, deployment, monitoring, and incident response for finance reporting data flowing from operational PostgreSQL databases and third-party payment APIs into Snowflake.
Scale Requirements
- Sources: 12 PostgreSQL databases, 4 external REST APIs
- Volume: 1.2 TB/day raw data, ~8 billion rows/day
- Batch frequency: Hourly ingestion for operational tables, daily backfills up to 2 years
- Latency target: Source to analytics-ready tables within 30 minutes for hourly loads
- Reliability target: 99.9% successful DAG runs per month
- Retention: Raw data for 180 days, curated warehouse tables for 7 years
Requirements
- Design a pipeline architecture that clearly separates responsibilities between data engineering and DevOps while preserving shared operational ownership.
- Ingest data incrementally from PostgreSQL and APIs, with support for schema evolution and replayable backfills.
- Orchestrate dependencies across extract, load, transform, and validation stages using a centralized scheduler.
- Ensure idempotent loads so reruns do not create duplicates or corrupt downstream tables.
- Implement automated data quality checks for freshness, row-count anomalies, null spikes, and referential integrity.
- Define CI/CD, infrastructure-as-code, secret management, and environment promotion across dev, staging, and prod.
- Provide monitoring, alerting, and failure recovery procedures that both teams can use during incidents.
Constraints
- AWS is the required cloud platform
- Incremental platform budget is capped at $18K/month
- PCI-related payment data must be encrypted in transit and at rest
- Team size: 3 data engineers, 2 DevOps engineers
- Minimize custom infrastructure; prefer managed services where possible