Ensure Data Accuracy in ETL

Context

FinPulse, a personal finance platform, ingests transaction, account balance, and user profile data from partner banks and internal product databases. Its current pipeline uses nightly batch jobs to load CSV and JSON extracts into Snowflake, but analysts frequently find duplicate records, schema drift, and inconsistent aggregates between source systems and warehouse tables.

You are asked to design a reliable batch-first data pipeline that ensures data accuracy and integrity from ingestion through serving. The company wants stronger validation, reproducibility, and recovery without rebuilding its entire AWS-based stack.

Scale Requirements

Sources: 25 partner bank SFTP feeds, 8 internal PostgreSQL databases, 3 SaaS APIs
Volume: 120M transaction records/day, ~1.5 TB raw data/day
Batch cadence: Hourly for internal sources, daily for partner feeds
Latency target: Internal data queryable within 30 minutes; partner feeds within 2 hours of arrival
Retention: Raw data 1 year, curated warehouse tables 5 years
Data quality SLA: 99.9% of records loaded with validated schema and referential integrity checks

Requirements

Design ingestion for mixed file-based and database/API sources into a centralized raw zone.
Implement schema validation, null checks, uniqueness checks, and reconciliation against source row counts and control totals.
Prevent duplicate loads during retries and support idempotent backfills for historical dates.
Build transformation layers for standardized transaction, account, and customer tables in Snowflake.
Orchestrate dependencies across ingestion, validation, transformation, and publishing steps.
Define monitoring, alerting, and failure recovery for late files, bad schemas, partial loads, and downstream model failures.
Preserve auditability: every published table must be traceable to source file, extraction timestamp, and pipeline run ID.

Constraints

Existing platform is AWS + Snowflake + Airflow; avoid introducing more than one major new platform.
Team size is 3 data engineers and 1 analytics engineer.
Budget increase is capped at $15K/month.
Must satisfy SOX-style auditability and support PII masking for non-production environments.

Context

Scale Requirements

Sources: 25 partner bank SFTP feeds, 8 internal PostgreSQL databases, 3 SaaS APIs

Volume: 120M transaction records/day, ~1.5 TB raw data/day

Batch cadence: Hourly for internal sources, daily for partner feeds

Latency target: Internal data queryable within 30 minutes; partner feeds within 2 hours of arrival

Retention: Raw data 1 year, curated warehouse tables 5 years

Data quality SLA: 99.9% of records loaded with validated schema and referential integrity checks

Requirements

Design ingestion for mixed file-based and database/API sources into a centralized raw zone.

Implement schema validation, null checks, uniqueness checks, and reconciliation against source row counts and control totals.

Prevent duplicate loads during retries and support idempotent backfills for historical dates.

Build transformation layers for standardized transaction, account, and customer tables in Snowflake.

Orchestrate dependencies across ingestion, validation, transformation, and publishing steps.

Define monitoring, alerting, and failure recovery for late files, bad schemas, partial loads, and downstream model failures.

Preserve auditability: every published table must be traceable to source file, extraction timestamp, and pipeline run ID.

Problem

Context

Scale Requirements

Requirements

Constraints

Ensure Data Accuracy in ETL

Problem

Context

Scale Requirements

Requirements

Constraints