Validate Recurring Reporting Pipeline

Context

Northstar Retail generates daily finance and operations reports from transactional data in PostgreSQL, Stripe, and Salesforce. The current Airflow batch pipeline loads data into Snowflake and runs dbt models, but report consumers regularly find missing rows, duplicate transactions, and metric drift after source schema changes.

You need to redesign the reporting pipeline so recurring reports are published only after data integrity checks pass, while keeping the system simple enough for a small data team to operate.

Scale Requirements

Sources: PostgreSQL OLTP (~120M orders total), Stripe payouts/refunds (~8M records), Salesforce account updates (~15M records)
Daily ingest volume: ~25M changed rows/day, ~180 GB raw/day
Reporting cadence: 3 daily reports at 6 AM, 12 PM, and 6 PM UTC
Latency target: Source extract to validated report tables in < 45 minutes
Historical retention: 3 years in Snowflake, 90 days of raw staged files in S3
Data quality SLA: <0.1% unexplained row-count variance vs source systems

Requirements

Design a batch ETL/ELT pipeline that incrementally ingests source changes into Snowflake.
Implement automated integrity validation for every run, including row-count reconciliation, primary-key uniqueness, null checks on critical fields, freshness checks, and aggregate balancing for revenue metrics.
Prevent downstream report publication when validation fails, and provide a quarantine/error workflow for bad batches.
Support backfills for a date range without creating duplicates or corrupting published reports.
Define orchestration, retry behavior, and idempotent loading semantics.
Include monitoring, alerting, and auditability so analysts can trace which source snapshot produced each report.

Constraints

Existing stack is AWS + Airflow 2.x + Snowflake + dbt; avoid introducing more than one major new platform.
Team size: 3 data engineers, 2 analytics engineers.
Finance reports are SOX-sensitive; every published dataset must be versioned and reproducible.
Budget allows moderate warehouse scaling, but no always-on large compute clusters.

Context

You need to redesign the reporting pipeline so recurring reports are published only after data integrity checks pass, while keeping the system simple enough for a small data team to operate.

Scale Requirements

Sources: PostgreSQL OLTP (~120M orders total), Stripe payouts/refunds (~8M records), Salesforce account updates (~15M records)
Daily ingest volume: ~25M changed rows/day, ~180 GB raw/day
Reporting cadence: 3 daily reports at 6 AM, 12 PM, and 6 PM UTC
Latency target: Source extract to validated report tables in < 45 minutes
Historical retention: 3 years in Snowflake, 90 days of raw staged files in S3
Data quality SLA: <0.1% unexplained row-count variance vs source systems

Requirements

Design a batch ETL/ELT pipeline that incrementally ingests source changes into Snowflake.
Implement automated integrity validation for every run, including row-count reconciliation, primary-key uniqueness, null checks on critical fields, freshness checks, and aggregate balancing for revenue metrics.
Prevent downstream report publication when validation fails, and provide a quarantine/error workflow for bad batches.
Support backfills for a date range without creating duplicates or corrupting published reports.
Define orchestration, retry behavior, and idempotent loading semantics.
Include monitoring, alerting, and auditability so analysts can trace which source snapshot produced each report.

Constraints

Existing stack is AWS + Airflow 2.x + Snowflake + dbt; avoid introducing more than one major new platform.
Team size: 3 data engineers, 2 analytics engineers.
Finance reports are SOX-sensitive; every published dataset must be versioned and reproducible.
Budget allows moderate warehouse scaling, but no always-on large compute clusters.

Context

You need to redesign the reporting pipeline so recurring reports are published only after data integrity checks pass, while keeping the system simple enough for a small data team to operate.

Scale Requirements

Sources: PostgreSQL OLTP (~120M orders total), Stripe payouts/refunds (~8M records), Salesforce account updates (~15M records)
Daily ingest volume: ~25M changed rows/day, ~180 GB raw/day
Reporting cadence: 3 daily reports at 6 AM, 12 PM, and 6 PM UTC
Latency target: Source extract to validated report tables in < 45 minutes
Historical retention: 3 years in Snowflake, 90 days of raw staged files in S3
Data quality SLA: <0.1% unexplained row-count variance vs source systems

Requirements

Design a batch ETL/ELT pipeline that incrementally ingests source changes into Snowflake.
Implement automated integrity validation for every run, including row-count reconciliation, primary-key uniqueness, null checks on critical fields, freshness checks, and aggregate balancing for revenue metrics.
Prevent downstream report publication when validation fails, and provide a quarantine/error workflow for bad batches.
Support backfills for a date range without creating duplicates or corrupting published reports.
Define orchestration, retry behavior, and idempotent loading semantics.
Include monitoring, alerting, and auditability so analysts can trace which source snapshot produced each report.

Constraints

Existing stack is AWS + Airflow 2.x + Snowflake + dbt; avoid introducing more than one major new platform.
Team size: 3 data engineers, 2 analytics engineers.
Finance reports are SOX-sensitive; every published dataset must be versioned and reproducible.
Budget allows moderate warehouse scaling, but no always-on large compute clusters.

Context

You need to redesign the reporting pipeline so recurring reports are published only after data integrity checks pass, while keeping the system simple enough for a small data team to operate.

Scale Requirements

Sources: PostgreSQL OLTP (~120M orders total), Stripe payouts/refunds (~8M records), Salesforce account updates (~15M records)
Daily ingest volume: ~25M changed rows/day, ~180 GB raw/day
Reporting cadence: 3 daily reports at 6 AM, 12 PM, and 6 PM UTC
Latency target: Source extract to validated report tables in < 45 minutes
Historical retention: 3 years in Snowflake, 90 days of raw staged files in S3
Data quality SLA: <0.1% unexplained row-count variance vs source systems

Requirements

Design a batch ETL/ELT pipeline that incrementally ingests source changes into Snowflake.
Implement automated integrity validation for every run, including row-count reconciliation, primary-key uniqueness, null checks on critical fields, freshness checks, and aggregate balancing for revenue metrics.
Prevent downstream report publication when validation fails, and provide a quarantine/error workflow for bad batches.
Support backfills for a date range without creating duplicates or corrupting published reports.
Define orchestration, retry behavior, and idempotent loading semantics.
Include monitoring, alerting, and auditability so analysts can trace which source snapshot produced each report.

Constraints

Existing stack is AWS + Airflow 2.x + Snowflake + dbt; avoid introducing more than one major new platform.
Team size: 3 data engineers, 2 analytics engineers.
Finance reports are SOX-sensitive; every published dataset must be versioned and reproducible.
Budget allows moderate warehouse scaling, but no always-on large compute clusters.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Validate Recurring Reporting Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Validate Recurring Reporting Pipeline

Context

Scale Requirements

Requirements

Constraints

Validate Recurring Reporting Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer