Context
FinSight, a B2B fintech analytics company, currently uses nightly ETL jobs to extract data from PostgreSQL, Salesforce, and Stripe, transform it on Apache Spark, and load curated tables into Snowflake. The platform now supports self-serve analytics for 300 internal users, and the current ETL design is slowing schema changes, increasing pipeline maintenance, and delaying access to raw source data.
Your task is to design a modernized pipeline and explain when ETL vs ELT is appropriate. The target architecture should preserve existing batch SLAs while shifting most transformations into Snowflake for faster iteration and better lineage.
Scale Requirements
- Sources: PostgreSQL OLTP (~2 TB), Salesforce (~150 GB), Stripe API (~80 GB/month)
- Daily ingest volume: ~450 GB/day incremental, ~12M changed rows/day
- Latency target: raw data available in warehouse within 15 minutes of extraction; business marts refreshed hourly
- Retention: raw history for 2 years, curated marts for 5 years
- Concurrency: 300 BI users, 40 scheduled dashboard queries, 25 dbt jobs/hour
Requirements
- Design a batch pipeline that lands source data with minimal transformation and supports both ETL and ELT patterns.
- Show how raw, staging, and mart layers are modeled in Snowflake.
- Define how incremental loads, idempotent reruns, and backfills are handled.
- Include data quality checks for schema drift, null spikes, duplicate primary keys, and row-count anomalies.
- Explain which transformations should remain pre-load (ETL) versus post-load in-warehouse (ELT).
- Describe orchestration, monitoring, and failure recovery for hourly and nightly jobs.
Constraints
- Existing stack is AWS-based and Snowflake is already the central warehouse.
- Team size is 3 data engineers; operational simplicity matters more than perfect real-time freshness.
- Budget increase is capped at $15K/month.
- PII must be masked in analytics schemas and deleted within 30 days of verified requests.