Secure Multi-Cloud Payments ETL

Context

You’re interviewing for a Senior Data Engineer role on the Risk & Compliance Data Platform team at StripeFlow, a fintech payment processor that handles $40B/year in card-not-present transactions for mid-market e-commerce merchants across the US and EU. The company is expanding into issuing and fraud analytics, and regulators (PCI DSS, SOC 2, GDPR) are increasing scrutiny on how payment and identity data flows through internal analytics systems.

Today, StripeFlow’s analytics stack is split across clouds: transaction processing runs in AWS, customer identity verification (KYC) runs in GCP, and the central warehouse is Snowflake on AWS. The current pipeline is a mix of ad-hoc Airflow DAGs and manual backfills. Security reviews found multiple issues: long-lived credentials embedded in Airflow connections, overly permissive S3 bucket policies, inconsistent encryption settings, and incomplete audit trails for who accessed sensitive columns. A recent incident involved a contractor accidentally querying raw PAN-like fields in a staging table (masked in prod, unmasked in staging).

Your task is to design a secure, production-grade ETL/ELT pipeline that ingests payment events and KYC updates, transforms them into analytics-ready models, and serves them to internal fraud analysts and finance reporting—while meeting strict security and compliance requirements.

Scale Requirements

Ingestion volume:
- Payments events: 120K events/sec peak, 25K events/sec average
- KYC updates: 2K events/sec peak
Data size: ~12 TB/day raw (JSON + Avro), 2 PB/year retained in lake/warehouse
Freshness SLA:
- Fraud features: < 3 minutes end-to-end (event → queryable)
- Finance reporting: < 60 minutes
Availability: 99.9% for fraud pipeline; RPO 15 minutes, RTO 30 minutes
Retention:
- Raw immutable logs: 400 days (regulatory + investigations)
- Curated analytics tables: indefinite (with GDPR deletion support)

Data Characteristics

Key datasets

Dataset	Example fields	Sensitivity
`payment_event`	`event_id`, `merchant_id`, `amount`, `currency`, `card_fingerprint`, `ip_address`, `billing_postal`, `created_at`	High (PII + payment-related)
`chargeback`	`charge_id`, `reason_code`, `received_at`, `evidence_status`	Medium
`kyc_profile`	`user_id`, `legal_name`, `dob`, `address`, `document_type`, `verification_status`	Very high (PII)

Quality and operational issues

Late-arriving data: up to 24 hours late for chargebacks; KYC updates can arrive out of order.
Duplicates: retries from upstream services can produce duplicate event_id.
Schema evolution: new fields added weekly; occasional type changes (string → struct).
Multi-tenant: strict isolation by merchant_id for some consumers.

Requirements

Functional requirements

Ingest payment and KYC streams into a raw landing zone with immutability and replay capability.
Transform into curated models for:
- fraud feature tables (near real-time)
- finance aggregates (hourly)
- compliance/audit extracts (on demand)
Support backfills and reprocessing without double-counting (idempotent loads).
Implement data classification (PII, SPI, confidential) and enforce handling rules.
Provide row/column-level access controls for analysts, data scientists, and finance.

Security & compliance requirements (core of the question)

Encryption in transit and at rest across Kafka, object storage, and Snowflake.
Secrets management: no long-lived keys in code or Airflow connections; use short-lived credentials.
Least privilege IAM across AWS/GCP/Snowflake; separate roles for ingest, transform, and read.
Auditability: end-to-end lineage and access logs (who queried what, when), retained for 1 year.
Data masking/tokenization for sensitive fields (e.g., names, DOB, address); prevent raw exposure in non-prod.
GDPR/DSAR deletion: delete a user’s personal data within 72 hours across lake + warehouse + derived tables.
Network controls: private connectivity (no public internet paths) where feasible.

Non-functional requirements

Cost target: incremental infra spend < $80K/month.
Operable by a team of 5 data engineers with on-call rotation.

Constraints

Existing orchestration is Apache Airflow 2.x (self-managed on EKS).
Streaming platform is Kafka (Amazon MSK); batch compute is Spark on EMR.
Warehouse is Snowflake; transformations are expected to use dbt where possible.
Security team mandates: centralized KMS, mandatory rotation, and separation of duties between platform engineers and analysts.

Interview Prompts

Propose an end-to-end architecture (stream + batch) that meets the SLAs and security requirements.
Specify how you will implement:
- IAM role design (AWS, GCP, Snowflake)
- encryption and key management
- secrets management and credential rotation
- masking/tokenization and environment controls
- auditing, lineage, and incident response hooks
Describe how you handle late-arriving data, deduplication, and schema evolution without breaking security guarantees.
Provide a monitoring and alerting plan focused on both pipeline health and security posture.
Explain trade-offs (e.g., tokenization vs masking, streaming vs micro-batch, Snowflake external stages vs direct ingest).

Context

Scale Requirements

Ingestion volume:
- Payments events: 120K events/sec peak, 25K events/sec average
- KYC updates: 2K events/sec peak
Data size: ~12 TB/day raw (JSON + Avro), 2 PB/year retained in lake/warehouse
Freshness SLA:
- Fraud features: < 3 minutes end-to-end (event → queryable)
- Finance reporting: < 60 minutes
Availability: 99.9% for fraud pipeline; RPO 15 minutes, RTO 30 minutes
Retention:
- Raw immutable logs: 400 days (regulatory + investigations)
- Curated analytics tables: indefinite (with GDPR deletion support)

Data Characteristics

Key datasets

Dataset	Example fields	Sensitivity
`payment_event`	`event_id`, `merchant_id`, `amount`, `currency`, `card_fingerprint`, `ip_address`, `billing_postal`, `created_at`	High (PII + payment-related)
`chargeback`	`charge_id`, `reason_code`, `received_at`, `evidence_status`	Medium
`kyc_profile`	`user_id`, `legal_name`, `dob`, `address`, `document_type`, `verification_status`	Very high (PII)

Quality and operational issues

Late-arriving data: up to 24 hours late for chargebacks; KYC updates can arrive out of order.
Duplicates: retries from upstream services can produce duplicate event_id.
Schema evolution: new fields added weekly; occasional type changes (string → struct).
Multi-tenant: strict isolation by merchant_id for some consumers.

Requirements

Functional requirements

Ingest payment and KYC streams into a raw landing zone with immutability and replay capability.
Transform into curated models for:
- fraud feature tables (near real-time)
- finance aggregates (hourly)
- compliance/audit extracts (on demand)
Support backfills and reprocessing without double-counting (idempotent loads).
Implement data classification (PII, SPI, confidential) and enforce handling rules.
Provide row/column-level access controls for analysts, data scientists, and finance.

Security & compliance requirements (core of the question)

Encryption in transit and at rest across Kafka, object storage, and Snowflake.
Secrets management: no long-lived keys in code or Airflow connections; use short-lived credentials.
Least privilege IAM across AWS/GCP/Snowflake; separate roles for ingest, transform, and read.
Auditability: end-to-end lineage and access logs (who queried what, when), retained for 1 year.
Data masking/tokenization for sensitive fields (e.g., names, DOB, address); prevent raw exposure in non-prod.
GDPR/DSAR deletion: delete a user’s personal data within 72 hours across lake + warehouse + derived tables.
Network controls: private connectivity (no public internet paths) where feasible.

Non-functional requirements

Cost target: incremental infra spend < $80K/month.
Operable by a team of 5 data engineers with on-call rotation.

Constraints

Existing orchestration is Apache Airflow 2.x (self-managed on EKS).
Streaming platform is Kafka (Amazon MSK); batch compute is Spark on EMR.
Warehouse is Snowflake; transformations are expected to use dbt where possible.
Security team mandates: centralized KMS, mandatory rotation, and separation of duties between platform engineers and analysts.

Interview Prompts

Propose an end-to-end architecture (stream + batch) that meets the SLAs and security requirements.
Specify how you will implement:
- IAM role design (AWS, GCP, Snowflake)
- encryption and key management
- secrets management and credential rotation
- masking/tokenization and environment controls
- auditing, lineage, and incident response hooks
Describe how you handle late-arriving data, deduplication, and schema evolution without breaking security guarantees.
Provide a monitoring and alerting plan focused on both pipeline health and security posture.
Explain trade-offs (e.g., tokenization vs masking, streaming vs micro-batch, Snowflake external stages vs direct ingest).

Context

Scale Requirements

Ingestion volume:
- Payments events: 120K events/sec peak, 25K events/sec average
- KYC updates: 2K events/sec peak
Data size: ~12 TB/day raw (JSON + Avro), 2 PB/year retained in lake/warehouse
Freshness SLA:
- Fraud features: < 3 minutes end-to-end (event → queryable)
- Finance reporting: < 60 minutes
Availability: 99.9% for fraud pipeline; RPO 15 minutes, RTO 30 minutes
Retention:
- Raw immutable logs: 400 days (regulatory + investigations)
- Curated analytics tables: indefinite (with GDPR deletion support)

Data Characteristics

Key datasets

Dataset	Example fields	Sensitivity
`payment_event`	`event_id`, `merchant_id`, `amount`, `currency`, `card_fingerprint`, `ip_address`, `billing_postal`, `created_at`	High (PII + payment-related)
`chargeback`	`charge_id`, `reason_code`, `received_at`, `evidence_status`	Medium
`kyc_profile`	`user_id`, `legal_name`, `dob`, `address`, `document_type`, `verification_status`	Very high (PII)

Quality and operational issues

Late-arriving data: up to 24 hours late for chargebacks; KYC updates can arrive out of order.
Duplicates: retries from upstream services can produce duplicate event_id.
Schema evolution: new fields added weekly; occasional type changes (string → struct).
Multi-tenant: strict isolation by merchant_id for some consumers.

Requirements

Functional requirements

Ingest payment and KYC streams into a raw landing zone with immutability and replay capability.
Transform into curated models for:
- fraud feature tables (near real-time)
- finance aggregates (hourly)
- compliance/audit extracts (on demand)
Support backfills and reprocessing without double-counting (idempotent loads).
Implement data classification (PII, SPI, confidential) and enforce handling rules.
Provide row/column-level access controls for analysts, data scientists, and finance.

Security & compliance requirements (core of the question)

Encryption in transit and at rest across Kafka, object storage, and Snowflake.
Secrets management: no long-lived keys in code or Airflow connections; use short-lived credentials.
Least privilege IAM across AWS/GCP/Snowflake; separate roles for ingest, transform, and read.
Auditability: end-to-end lineage and access logs (who queried what, when), retained for 1 year.
Data masking/tokenization for sensitive fields (e.g., names, DOB, address); prevent raw exposure in non-prod.
GDPR/DSAR deletion: delete a user’s personal data within 72 hours across lake + warehouse + derived tables.
Network controls: private connectivity (no public internet paths) where feasible.

Non-functional requirements

Cost target: incremental infra spend < $80K/month.
Operable by a team of 5 data engineers with on-call rotation.

Constraints

Existing orchestration is Apache Airflow 2.x (self-managed on EKS).
Streaming platform is Kafka (Amazon MSK); batch compute is Spark on EMR.
Warehouse is Snowflake; transformations are expected to use dbt where possible.
Security team mandates: centralized KMS, mandatory rotation, and separation of duties between platform engineers and analysts.

Interview Prompts

Propose an end-to-end architecture (stream + batch) that meets the SLAs and security requirements.
Specify how you will implement:
- IAM role design (AWS, GCP, Snowflake)
- encryption and key management
- secrets management and credential rotation
- masking/tokenization and environment controls
- auditing, lineage, and incident response hooks
Describe how you handle late-arriving data, deduplication, and schema evolution without breaking security guarantees.
Provide a monitoring and alerting plan focused on both pipeline health and security posture.
Explain trade-offs (e.g., tokenization vs masking, streaming vs micro-batch, Snowflake external stages vs direct ingest).

Context

Scale Requirements

Ingestion volume:
- Payments events: 120K events/sec peak, 25K events/sec average
- KYC updates: 2K events/sec peak
Data size: ~12 TB/day raw (JSON + Avro), 2 PB/year retained in lake/warehouse
Freshness SLA:
- Fraud features: < 3 minutes end-to-end (event → queryable)
- Finance reporting: < 60 minutes
Availability: 99.9% for fraud pipeline; RPO 15 minutes, RTO 30 minutes
Retention:
- Raw immutable logs: 400 days (regulatory + investigations)
- Curated analytics tables: indefinite (with GDPR deletion support)

Data Characteristics

Key datasets

Dataset	Example fields	Sensitivity
`payment_event`	`event_id`, `merchant_id`, `amount`, `currency`, `card_fingerprint`, `ip_address`, `billing_postal`, `created_at`	High (PII + payment-related)
`chargeback`	`charge_id`, `reason_code`, `received_at`, `evidence_status`	Medium
`kyc_profile`	`user_id`, `legal_name`, `dob`, `address`, `document_type`, `verification_status`	Very high (PII)

Quality and operational issues

Late-arriving data: up to 24 hours late for chargebacks; KYC updates can arrive out of order.
Duplicates: retries from upstream services can produce duplicate event_id.
Schema evolution: new fields added weekly; occasional type changes (string → struct).
Multi-tenant: strict isolation by merchant_id for some consumers.

Requirements

Functional requirements

Ingest payment and KYC streams into a raw landing zone with immutability and replay capability.
Transform into curated models for:
- fraud feature tables (near real-time)
- finance aggregates (hourly)
- compliance/audit extracts (on demand)
Support backfills and reprocessing without double-counting (idempotent loads).
Implement data classification (PII, SPI, confidential) and enforce handling rules.
Provide row/column-level access controls for analysts, data scientists, and finance.

Security & compliance requirements (core of the question)

Encryption in transit and at rest across Kafka, object storage, and Snowflake.
Secrets management: no long-lived keys in code or Airflow connections; use short-lived credentials.
Least privilege IAM across AWS/GCP/Snowflake; separate roles for ingest, transform, and read.
Auditability: end-to-end lineage and access logs (who queried what, when), retained for 1 year.
Data masking/tokenization for sensitive fields (e.g., names, DOB, address); prevent raw exposure in non-prod.
GDPR/DSAR deletion: delete a user’s personal data within 72 hours across lake + warehouse + derived tables.
Network controls: private connectivity (no public internet paths) where feasible.

Non-functional requirements

Cost target: incremental infra spend < $80K/month.
Operable by a team of 5 data engineers with on-call rotation.

Constraints

Existing orchestration is Apache Airflow 2.x (self-managed on EKS).
Streaming platform is Kafka (Amazon MSK); batch compute is Spark on EMR.
Warehouse is Snowflake; transformations are expected to use dbt where possible.
Security team mandates: centralized KMS, mandatory rotation, and separation of duties between platform engineers and analysts.

Interview Prompts

Propose an end-to-end architecture (stream + batch) that meets the SLAs and security requirements.
Specify how you will implement:
- IAM role design (AWS, GCP, Snowflake)
- encryption and key management
- secrets management and credential rotation
- masking/tokenization and environment controls
- auditing, lineage, and incident response hooks
Describe how you handle late-arriving data, deduplication, and schema evolution without breaking security guarantees.
Provide a monitoring and alerting plan focused on both pipeline health and security posture.
Explain trade-offs (e.g., tokenization vs masking, streaming vs micro-batch, Snowflake external stages vs direct ingest).

Interview Guides

Context

Scale Requirements

Data Characteristics

Key datasets

Quality and operational issues

Requirements

Functional requirements

Security & compliance requirements (core of the question)

Non-functional requirements

Constraints

Interview Prompts

Secure Multi-Cloud Payments ETL

Context

Scale Requirements

Data Characteristics

Key datasets

Quality and operational issues

Requirements

Functional requirements

Security & compliance requirements (core of the question)

Non-functional requirements

Constraints

Interview Prompts

Your Answer

Secure Multi-Cloud Payments ETL

Context

Scale Requirements

Data Characteristics

Key datasets

Quality and operational issues

Requirements

Functional requirements

Security & compliance requirements (core of the question)

Non-functional requirements

Constraints

Interview Prompts

Secure Multi-Cloud Payments ETL

Context

Scale Requirements

Data Characteristics

Key datasets

Quality and operational issues

Requirements

Functional requirements

Security & compliance requirements (core of the question)

Non-functional requirements

Constraints

Interview Prompts

Your Answer