Design Data Integrity ETL Pipeline

Context

LedgerBox, a B2B payments platform, ingests transaction, refund, and settlement data from PostgreSQL, Stripe APIs, and internal event logs into Snowflake for finance and operations reporting. The current pipeline relies on nightly Airflow jobs and ad hoc SQL checks, which has led to duplicate loads, schema drift issues, and reconciliation gaps between source systems and warehouse tables.

You need to redesign the pipeline so data integrity is enforced end-to-end across ingestion, transformation, and serving layers, while supporting both daily batch loads and targeted backfills.

Scale Requirements

Sources: 3 PostgreSQL OLTP databases, Stripe REST API, S3 event exports
Volume: 120M transaction-related records/day, ~1.5 TB raw/day
Peak batch window: 8M rows/hour during settlement close
Latency target: finance tables available within 15 minutes of source extract completion
Retention: 7 years for curated finance data, 180 days for raw landing data
Data quality SLA: critical tables must maintain 99.99% row-level accuracy

Requirements

Design an ETL/ELT pipeline that guarantees no missing, duplicated, or partially applied records in curated tables.
Support incremental loads using CDC or high-watermark extraction, with deterministic reprocessing for backfills.
Implement integrity controls including schema validation, null/uniqueness checks, referential integrity, and source-to-target reconciliation.
Ensure transformations are idempotent so retries do not create duplicate facts or inconsistent dimensions.
Orchestrate dependencies across raw ingestion, staging validation, dbt transformations, and publish steps.
Provide a strategy for handling late-arriving records, source corrections, and schema evolution.
Define monitoring, alerting, and recovery procedures for failed loads and data quality breaches.

Constraints

Stack must remain AWS + Snowflake centric
Team size is 5 data engineers; operational complexity should be moderate
Finance data is SOX-auditable; every load must be traceable by batch_id and source extract timestamp
Budget allows managed services, but not a large always-on Spark cluster
Downstream BI dashboards must never read partially validated tables

Context

You need to redesign the pipeline so data integrity is enforced end-to-end across ingestion, transformation, and serving layers, while supporting both daily batch loads and targeted backfills.

Scale Requirements

Sources: 3 PostgreSQL OLTP databases, Stripe REST API, S3 event exports
Volume: 120M transaction-related records/day, ~1.5 TB raw/day
Peak batch window: 8M rows/hour during settlement close
Latency target: finance tables available within 15 minutes of source extract completion
Retention: 7 years for curated finance data, 180 days for raw landing data
Data quality SLA: critical tables must maintain 99.99% row-level accuracy

Requirements

Design an ETL/ELT pipeline that guarantees no missing, duplicated, or partially applied records in curated tables.
Support incremental loads using CDC or high-watermark extraction, with deterministic reprocessing for backfills.
Implement integrity controls including schema validation, null/uniqueness checks, referential integrity, and source-to-target reconciliation.
Ensure transformations are idempotent so retries do not create duplicate facts or inconsistent dimensions.
Orchestrate dependencies across raw ingestion, staging validation, dbt transformations, and publish steps.
Provide a strategy for handling late-arriving records, source corrections, and schema evolution.
Define monitoring, alerting, and recovery procedures for failed loads and data quality breaches.

Constraints

Stack must remain AWS + Snowflake centric
Team size is 5 data engineers; operational complexity should be moderate
Finance data is SOX-auditable; every load must be traceable by batch_id and source extract timestamp
Budget allows managed services, but not a large always-on Spark cluster
Downstream BI dashboards must never read partially validated tables

Context

You need to redesign the pipeline so data integrity is enforced end-to-end across ingestion, transformation, and serving layers, while supporting both daily batch loads and targeted backfills.

Scale Requirements

Sources: 3 PostgreSQL OLTP databases, Stripe REST API, S3 event exports
Volume: 120M transaction-related records/day, ~1.5 TB raw/day
Peak batch window: 8M rows/hour during settlement close
Latency target: finance tables available within 15 minutes of source extract completion
Retention: 7 years for curated finance data, 180 days for raw landing data
Data quality SLA: critical tables must maintain 99.99% row-level accuracy

Requirements

Design an ETL/ELT pipeline that guarantees no missing, duplicated, or partially applied records in curated tables.
Support incremental loads using CDC or high-watermark extraction, with deterministic reprocessing for backfills.
Implement integrity controls including schema validation, null/uniqueness checks, referential integrity, and source-to-target reconciliation.
Ensure transformations are idempotent so retries do not create duplicate facts or inconsistent dimensions.
Orchestrate dependencies across raw ingestion, staging validation, dbt transformations, and publish steps.
Provide a strategy for handling late-arriving records, source corrections, and schema evolution.
Define monitoring, alerting, and recovery procedures for failed loads and data quality breaches.

Constraints

Stack must remain AWS + Snowflake centric
Team size is 5 data engineers; operational complexity should be moderate
Finance data is SOX-auditable; every load must be traceable by batch_id and source extract timestamp
Budget allows managed services, but not a large always-on Spark cluster
Downstream BI dashboards must never read partially validated tables

Context

You need to redesign the pipeline so data integrity is enforced end-to-end across ingestion, transformation, and serving layers, while supporting both daily batch loads and targeted backfills.

Scale Requirements

Sources: 3 PostgreSQL OLTP databases, Stripe REST API, S3 event exports
Volume: 120M transaction-related records/day, ~1.5 TB raw/day
Peak batch window: 8M rows/hour during settlement close
Latency target: finance tables available within 15 minutes of source extract completion
Retention: 7 years for curated finance data, 180 days for raw landing data
Data quality SLA: critical tables must maintain 99.99% row-level accuracy

Requirements

Design an ETL/ELT pipeline that guarantees no missing, duplicated, or partially applied records in curated tables.
Support incremental loads using CDC or high-watermark extraction, with deterministic reprocessing for backfills.
Implement integrity controls including schema validation, null/uniqueness checks, referential integrity, and source-to-target reconciliation.
Ensure transformations are idempotent so retries do not create duplicate facts or inconsistent dimensions.
Orchestrate dependencies across raw ingestion, staging validation, dbt transformations, and publish steps.
Provide a strategy for handling late-arriving records, source corrections, and schema evolution.
Define monitoring, alerting, and recovery procedures for failed loads and data quality breaches.

Constraints

Stack must remain AWS + Snowflake centric
Team size is 5 data engineers; operational complexity should be moderate
Finance data is SOX-auditable; every load must be traceable by batch_id and source extract timestamp
Budget allows managed services, but not a large always-on Spark cluster
Downstream BI dashboards must never read partially validated tables

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Data Integrity ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Data Integrity ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Data Integrity ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer