Interview Guides

Build Cross-Team Resilient ETL Pipeline

Easy

Pipelines

Context

FinSight, a B2B payments analytics company, currently runs ad hoc Python ETL jobs on EC2 instances managed separately by the data engineering and DevOps teams. Pipelines frequently fail during deployments, infrastructure changes, and schema updates because ownership boundaries, observability, and recovery procedures are unclear.

You need to design a robust batch-first pipeline platform that data engineers and DevOps can jointly operate. The goal is to standardize ingestion, orchestration, deployment, monitoring, and incident response for finance reporting data flowing from operational PostgreSQL databases and third-party payment APIs into Snowflake.

Scale Requirements

Sources: 12 PostgreSQL databases, 4 external REST APIs
Volume: 1.2 TB/day raw data, ~8 billion rows/day
Batch frequency: Hourly ingestion for operational tables, daily backfills up to 2 years
Latency target: Source to analytics-ready tables within 30 minutes for hourly loads
Reliability target: 99.9% successful DAG runs per month
Retention: Raw data for 180 days, curated warehouse tables for 7 years

Requirements

Design a pipeline architecture that clearly separates responsibilities between data engineering and DevOps while preserving shared operational ownership.
Ingest data incrementally from PostgreSQL and APIs, with support for schema evolution and replayable backfills.
Orchestrate dependencies across extract, load, transform, and validation stages using a centralized scheduler.
Ensure idempotent loads so reruns do not create duplicates or corrupt downstream tables.
Implement automated data quality checks for freshness, row-count anomalies, null spikes, and referential integrity.
Define CI/CD, infrastructure-as-code, secret management, and environment promotion across dev, staging, and prod.
Provide monitoring, alerting, and failure recovery procedures that both teams can use during incidents.

Constraints

AWS is the required cloud platform
Incremental platform budget is capped at $18K/month
PCI-related payment data must be encrypted in transit and at rest
Team size: 3 data engineers, 2 DevOps engineers
Minimize custom infrastructure; prefer managed services where possible

Build Cross-Team Resilient ETL Pipeline

Easy

Pipelines

Context

Scale Requirements

Sources: 12 PostgreSQL databases, 4 external REST APIs
Volume: 1.2 TB/day raw data, ~8 billion rows/day
Batch frequency: Hourly ingestion for operational tables, daily backfills up to 2 years
Latency target: Source to analytics-ready tables within 30 minutes for hourly loads
Reliability target: 99.9% successful DAG runs per month
Retention: Raw data for 180 days, curated warehouse tables for 7 years

Requirements

Design a pipeline architecture that clearly separates responsibilities between data engineering and DevOps while preserving shared operational ownership.
Ingest data incrementally from PostgreSQL and APIs, with support for schema evolution and replayable backfills.
Orchestrate dependencies across extract, load, transform, and validation stages using a centralized scheduler.
Ensure idempotent loads so reruns do not create duplicates or corrupt downstream tables.
Implement automated data quality checks for freshness, row-count anomalies, null spikes, and referential integrity.
Define CI/CD, infrastructure-as-code, secret management, and environment promotion across dev, staging, and prod.
Provide monitoring, alerting, and failure recovery procedures that both teams can use during incidents.

Constraints

AWS is the required cloud platform
Incremental platform budget is capped at $18K/month
PCI-related payment data must be encrypted in transit and at rest
Team size: 3 data engineers, 2 DevOps engineers
Minimize custom infrastructure; prefer managed services where possible

Your Answer

Build Cross-Team Resilient ETL Pipeline

Easy

Pipelines

Context

Scale Requirements

Sources: 12 PostgreSQL databases, 4 external REST APIs
Volume: 1.2 TB/day raw data, ~8 billion rows/day
Batch frequency: Hourly ingestion for operational tables, daily backfills up to 2 years
Latency target: Source to analytics-ready tables within 30 minutes for hourly loads
Reliability target: 99.9% successful DAG runs per month
Retention: Raw data for 180 days, curated warehouse tables for 7 years

Requirements

Design a pipeline architecture that clearly separates responsibilities between data engineering and DevOps while preserving shared operational ownership.
Ingest data incrementally from PostgreSQL and APIs, with support for schema evolution and replayable backfills.
Orchestrate dependencies across extract, load, transform, and validation stages using a centralized scheduler.
Ensure idempotent loads so reruns do not create duplicates or corrupt downstream tables.
Implement automated data quality checks for freshness, row-count anomalies, null spikes, and referential integrity.
Define CI/CD, infrastructure-as-code, secret management, and environment promotion across dev, staging, and prod.
Provide monitoring, alerting, and failure recovery procedures that both teams can use during incidents.

Constraints

AWS is the required cloud platform
Incremental platform budget is capped at $18K/month
PCI-related payment data must be encrypted in transit and at rest
Team size: 3 data engineers, 2 DevOps engineers
Minimize custom infrastructure; prefer managed services where possible

Build Cross-Team Resilient ETL Pipeline

Easy

Pipelines

Context

Scale Requirements

Sources: 12 PostgreSQL databases, 4 external REST APIs
Volume: 1.2 TB/day raw data, ~8 billion rows/day
Batch frequency: Hourly ingestion for operational tables, daily backfills up to 2 years
Latency target: Source to analytics-ready tables within 30 minutes for hourly loads
Reliability target: 99.9% successful DAG runs per month
Retention: Raw data for 180 days, curated warehouse tables for 7 years

Requirements

Design a pipeline architecture that clearly separates responsibilities between data engineering and DevOps while preserving shared operational ownership.
Ingest data incrementally from PostgreSQL and APIs, with support for schema evolution and replayable backfills.
Orchestrate dependencies across extract, load, transform, and validation stages using a centralized scheduler.
Ensure idempotent loads so reruns do not create duplicates or corrupt downstream tables.
Implement automated data quality checks for freshness, row-count anomalies, null spikes, and referential integrity.
Define CI/CD, infrastructure-as-code, secret management, and environment promotion across dev, staging, and prod.
Provide monitoring, alerting, and failure recovery procedures that both teams can use during incidents.

Constraints

AWS is the required cloud platform
Incremental platform budget is capped at $18K/month
PCI-related payment data must be encrypted in transit and at rest
Team size: 3 data engineers, 2 DevOps engineers
Minimize custom infrastructure; prefer managed services where possible