Design Data Integrity ETL Pipeline

Context

FinLedger, a B2B payments platform, ingests transaction, refund, and settlement data from PostgreSQL OLTP databases, third-party payment processor APIs, and S3-delivered CSV files from banking partners. The current pipeline uses nightly Python cron jobs to load data into Snowflake, but analysts frequently find duplicate records, missing settlements, and inconsistent balances across downstream tables.

You need to redesign the pipeline so data integrity is enforced end-to-end while preserving daily financial reporting and supporting near-real-time operational reconciliation.

Scale Requirements

Sources: 3 PostgreSQL databases, 2 external REST APIs, 1 S3 partner file feed
Volume: 120M transaction-related records/day, ~1.5 TB raw/day
Peak ingestion: 8K records/sec from CDC streams, 200 partner files/day
Latency: operational reconciliation tables < 10 minutes; finance reporting tables < 2 hours
Retention: raw immutable data for 7 years; curated warehouse tables indefinitely

Requirements

Design an ingestion and transformation pipeline that guarantees no silent data loss and minimizes duplicates.
Support idempotent reprocessing for failed jobs, backfills, and late-arriving partner files.
Implement data integrity controls including schema validation, record-level deduplication, referential integrity checks, and reconciliation between source totals and warehouse totals.
Track lineage from source extract to curated Snowflake tables for auditability.
Orchestrate batch and micro-batch jobs with clear dependency management and retry behavior.
Expose data quality status to finance and operations teams before downstream tables are marked ready.
Define monitoring, alerting, and failure recovery for ingestion, transformation, and load stages.

Constraints

AWS is the required cloud environment.
Incremental budget is capped at $35K/month.
PCI-related payment data must be encrypted in transit and at rest.
Finance requires reproducible month-end reruns using historical raw snapshots.
The team has strong SQL/Airflow skills, moderate Spark experience, and no appetite for a fully custom platform.

Context

You need to redesign the pipeline so data integrity is enforced end-to-end while preserving daily financial reporting and supporting near-real-time operational reconciliation.

Scale Requirements

Sources: 3 PostgreSQL databases, 2 external REST APIs, 1 S3 partner file feed
Volume: 120M transaction-related records/day, ~1.5 TB raw/day
Peak ingestion: 8K records/sec from CDC streams, 200 partner files/day
Latency: operational reconciliation tables < 10 minutes; finance reporting tables < 2 hours
Retention: raw immutable data for 7 years; curated warehouse tables indefinitely

Requirements

Design an ingestion and transformation pipeline that guarantees no silent data loss and minimizes duplicates.
Support idempotent reprocessing for failed jobs, backfills, and late-arriving partner files.
Implement data integrity controls including schema validation, record-level deduplication, referential integrity checks, and reconciliation between source totals and warehouse totals.
Track lineage from source extract to curated Snowflake tables for auditability.
Orchestrate batch and micro-batch jobs with clear dependency management and retry behavior.
Expose data quality status to finance and operations teams before downstream tables are marked ready.
Define monitoring, alerting, and failure recovery for ingestion, transformation, and load stages.

Constraints

AWS is the required cloud environment.
Incremental budget is capped at $35K/month.
PCI-related payment data must be encrypted in transit and at rest.
Finance requires reproducible month-end reruns using historical raw snapshots.
The team has strong SQL/Airflow skills, moderate Spark experience, and no appetite for a fully custom platform.

Context

You need to redesign the pipeline so data integrity is enforced end-to-end while preserving daily financial reporting and supporting near-real-time operational reconciliation.

Scale Requirements

Sources: 3 PostgreSQL databases, 2 external REST APIs, 1 S3 partner file feed
Volume: 120M transaction-related records/day, ~1.5 TB raw/day
Peak ingestion: 8K records/sec from CDC streams, 200 partner files/day
Latency: operational reconciliation tables < 10 minutes; finance reporting tables < 2 hours
Retention: raw immutable data for 7 years; curated warehouse tables indefinitely

Requirements

Design an ingestion and transformation pipeline that guarantees no silent data loss and minimizes duplicates.
Support idempotent reprocessing for failed jobs, backfills, and late-arriving partner files.
Implement data integrity controls including schema validation, record-level deduplication, referential integrity checks, and reconciliation between source totals and warehouse totals.
Track lineage from source extract to curated Snowflake tables for auditability.
Orchestrate batch and micro-batch jobs with clear dependency management and retry behavior.
Expose data quality status to finance and operations teams before downstream tables are marked ready.
Define monitoring, alerting, and failure recovery for ingestion, transformation, and load stages.

Constraints

AWS is the required cloud environment.
Incremental budget is capped at $35K/month.
PCI-related payment data must be encrypted in transit and at rest.
Finance requires reproducible month-end reruns using historical raw snapshots.
The team has strong SQL/Airflow skills, moderate Spark experience, and no appetite for a fully custom platform.

Context

You need to redesign the pipeline so data integrity is enforced end-to-end while preserving daily financial reporting and supporting near-real-time operational reconciliation.

Scale Requirements

Sources: 3 PostgreSQL databases, 2 external REST APIs, 1 S3 partner file feed
Volume: 120M transaction-related records/day, ~1.5 TB raw/day
Peak ingestion: 8K records/sec from CDC streams, 200 partner files/day
Latency: operational reconciliation tables < 10 minutes; finance reporting tables < 2 hours
Retention: raw immutable data for 7 years; curated warehouse tables indefinitely

Requirements

Design an ingestion and transformation pipeline that guarantees no silent data loss and minimizes duplicates.
Support idempotent reprocessing for failed jobs, backfills, and late-arriving partner files.
Implement data integrity controls including schema validation, record-level deduplication, referential integrity checks, and reconciliation between source totals and warehouse totals.
Track lineage from source extract to curated Snowflake tables for auditability.
Orchestrate batch and micro-batch jobs with clear dependency management and retry behavior.
Expose data quality status to finance and operations teams before downstream tables are marked ready.
Define monitoring, alerting, and failure recovery for ingestion, transformation, and load stages.

Constraints

AWS is the required cloud environment.
Incremental budget is capped at $35K/month.
PCI-related payment data must be encrypted in transit and at rest.
Finance requires reproducible month-end reruns using historical raw snapshots.
The team has strong SQL/Airflow skills, moderate Spark experience, and no appetite for a fully custom platform.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Data Integrity ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Data Integrity ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Data Integrity ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer