Secure Customer Data ETL Pipeline

Context

FinSure, an insurance technology company, ingests customer policy, claims, and payment data from a PostgreSQL OLTP database, partner SFTP drops, and a Kafka topic carrying application events. The current pipeline is a mix of ad hoc Python scripts and nightly SQL jobs, which has led to duplicate loads, inconsistent record counts, and weak controls around PII handling.

You need to redesign the pipeline so analytics and compliance teams can trust the data while meeting security requirements for regulated customer information.

Scale Requirements

Sources: PostgreSQL (~150 tables), partner SFTP files (~20K files/day), Kafka events (~15K events/sec peak)
Daily volume: ~1.2 TB raw data/day
Latency: batch data available within 30 minutes; streaming events available within 2 minutes
Retention: 7 years for curated warehouse data, 90 days for raw landing zone
Availability target: 99.9% for scheduled loads

Requirements

Design an end-to-end pipeline that ingests batch and streaming data into a centralized platform.
Ensure data integrity through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
Ensure data security through encryption in transit and at rest, RBAC, secrets management, PII masking/tokenization, and access auditing.
Support CDC or incremental extraction from PostgreSQL and reliable ingestion of partner files.
Build curated warehouse tables for policy, claims, and payments analytics.
Define orchestration, monitoring, alerting, and failure recovery.

Constraints

AWS is the required cloud platform.
Incremental budget is capped at $18K/month.
Compliance requirements: SOC 2, GDPR, and support for customer deletion requests within 72 hours.
Team size: 3 data engineers, so the solution should favor managed services over heavy platform operations.
Historical backfill of 2 years must be supported without corrupting current production tables.

Context

You need to redesign the pipeline so analytics and compliance teams can trust the data while meeting security requirements for regulated customer information.

Scale Requirements

Sources: PostgreSQL (~150 tables), partner SFTP files (~20K files/day), Kafka events (~15K events/sec peak)
Daily volume: ~1.2 TB raw data/day
Latency: batch data available within 30 minutes; streaming events available within 2 minutes
Retention: 7 years for curated warehouse data, 90 days for raw landing zone
Availability target: 99.9% for scheduled loads

Requirements

Design an end-to-end pipeline that ingests batch and streaming data into a centralized platform.
Ensure data integrity through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
Ensure data security through encryption in transit and at rest, RBAC, secrets management, PII masking/tokenization, and access auditing.
Support CDC or incremental extraction from PostgreSQL and reliable ingestion of partner files.
Build curated warehouse tables for policy, claims, and payments analytics.
Define orchestration, monitoring, alerting, and failure recovery.

Constraints

AWS is the required cloud platform.
Incremental budget is capped at $18K/month.
Compliance requirements: SOC 2, GDPR, and support for customer deletion requests within 72 hours.
Team size: 3 data engineers, so the solution should favor managed services over heavy platform operations.
Historical backfill of 2 years must be supported without corrupting current production tables.

Context

You need to redesign the pipeline so analytics and compliance teams can trust the data while meeting security requirements for regulated customer information.

Scale Requirements

Sources: PostgreSQL (~150 tables), partner SFTP files (~20K files/day), Kafka events (~15K events/sec peak)
Daily volume: ~1.2 TB raw data/day
Latency: batch data available within 30 minutes; streaming events available within 2 minutes
Retention: 7 years for curated warehouse data, 90 days for raw landing zone
Availability target: 99.9% for scheduled loads

Requirements

Design an end-to-end pipeline that ingests batch and streaming data into a centralized platform.
Ensure data integrity through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
Ensure data security through encryption in transit and at rest, RBAC, secrets management, PII masking/tokenization, and access auditing.
Support CDC or incremental extraction from PostgreSQL and reliable ingestion of partner files.
Build curated warehouse tables for policy, claims, and payments analytics.
Define orchestration, monitoring, alerting, and failure recovery.

Constraints

AWS is the required cloud platform.
Incremental budget is capped at $18K/month.
Compliance requirements: SOC 2, GDPR, and support for customer deletion requests within 72 hours.
Team size: 3 data engineers, so the solution should favor managed services over heavy platform operations.
Historical backfill of 2 years must be supported without corrupting current production tables.

Context

You need to redesign the pipeline so analytics and compliance teams can trust the data while meeting security requirements for regulated customer information.

Scale Requirements

Sources: PostgreSQL (~150 tables), partner SFTP files (~20K files/day), Kafka events (~15K events/sec peak)
Daily volume: ~1.2 TB raw data/day
Latency: batch data available within 30 minutes; streaming events available within 2 minutes
Retention: 7 years for curated warehouse data, 90 days for raw landing zone
Availability target: 99.9% for scheduled loads

Requirements

Design an end-to-end pipeline that ingests batch and streaming data into a centralized platform.
Ensure data integrity through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
Ensure data security through encryption in transit and at rest, RBAC, secrets management, PII masking/tokenization, and access auditing.
Support CDC or incremental extraction from PostgreSQL and reliable ingestion of partner files.
Build curated warehouse tables for policy, claims, and payments analytics.
Define orchestration, monitoring, alerting, and failure recovery.

Constraints

AWS is the required cloud platform.
Incremental budget is capped at $18K/month.
Compliance requirements: SOC 2, GDPR, and support for customer deletion requests within 72 hours.
Team size: 3 data engineers, so the solution should favor managed services over heavy platform operations.
Historical backfill of 2 years must be supported without corrupting current production tables.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Secure Customer Data ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Secure Customer Data ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Secure Customer Data ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer