Context
FinPulse, a personal finance platform, ingests transaction, account balance, and user profile data from partner banks and internal product databases. Its current pipeline uses nightly batch jobs to load CSV and JSON extracts into Snowflake, but analysts frequently find duplicate records, schema drift, and inconsistent aggregates between source systems and warehouse tables.
You are asked to design a reliable batch-first data pipeline that ensures data accuracy and integrity from ingestion through serving. The company wants stronger validation, reproducibility, and recovery without rebuilding its entire AWS-based stack.
Scale Requirements
- Sources: 25 partner bank SFTP feeds, 8 internal PostgreSQL databases, 3 SaaS APIs
- Volume: 120M transaction records/day, ~1.5 TB raw data/day
- Batch cadence: Hourly for internal sources, daily for partner feeds
- Latency target: Internal data queryable within 30 minutes; partner feeds within 2 hours of arrival
- Retention: Raw data 1 year, curated warehouse tables 5 years
- Data quality SLA: 99.9% of records loaded with validated schema and referential integrity checks
Requirements
- Design ingestion for mixed file-based and database/API sources into a centralized raw zone.
- Implement schema validation, null checks, uniqueness checks, and reconciliation against source row counts and control totals.
- Prevent duplicate loads during retries and support idempotent backfills for historical dates.
- Build transformation layers for standardized transaction, account, and customer tables in Snowflake.
- Orchestrate dependencies across ingestion, validation, transformation, and publishing steps.
- Define monitoring, alerting, and failure recovery for late files, bad schemas, partial loads, and downstream model failures.
- Preserve auditability: every published table must be traceable to source file, extraction timestamp, and pipeline run ID.
Constraints
- Existing platform is AWS + Snowflake + Airflow; avoid introducing more than one major new platform.
- Team size is 3 data engineers and 1 analytics engineer.
- Budget increase is capped at $15K/month.
- Must satisfy SOX-style auditability and support PII masking for non-production environments.