Disaster Recovery for Payments Pipelines

Context

You’re interviewing with the Payments Data Platform team at a large fintech that processes card and ACH transactions for ~120K merchants. The company supports 18M consumer wallets and peaks at 35K payment events/sec during regional shopping spikes. Data powers (1) real-time fraud features, (2) merchant settlement and reconciliation, and (3) regulatory reporting (SOX + PCI). A prolonged outage or incorrect reprocessing can lead to incorrect settlements, double-charging, or regulatory audit findings.

Today, the platform runs a hybrid pipeline:

Streaming: Payment authorization/clearing events flow through Kafka into Spark Structured Streaming for enrichment and are landed to S3 and Snowflake for near-real-time analytics.
Batch: Nightly Airflow + Spark jobs compute merchant-level settlement aggregates and reconciliation tables in Snowflake using dbt.

The current DR posture is weak: Kafka is single-region, Airflow metadata is single-AZ, and Spark checkpoints live in the same region as compute. A recent cloud incident caused a 2-hour regional impairment; the team recovered manually but had to re-run jobs and later discovered duplicate rows in Snowflake and mismatched settlement totals.

Your task is to propose best-practice disaster recovery for this data processing system—covering architecture, operations, and correctness—while meeting strict business SLAs.

Scale Requirements

Ingest throughput: avg 8K events/sec, peak 35K events/sec
Event size: ~1.5 KB JSON (auth/clearing/chargeback/refund)
Daily volume: ~1.2B events/day (~1.8 TB/day raw)
Latency SLOs:
- Fraud feature tables: P95 < 60 seconds from event time
- Analytics tables in Snowflake: P95 < 5 minutes
- Settlement aggregates: available by 06:00 UTC daily
Retention:
- Kafka: 7 days
- S3 raw: 2 years (immutable)
- Snowflake curated: 7 years (audit)

Data Characteristics

Key event types

Event Type	Example Keys	Notes
authorization	event_id, payment_id, merchant_id, amount, currency, event_ts	high volume; may be retried by upstream
clearing	event_id, payment_id, network_ref, event_ts	can arrive hours late
refund	event_id, payment_id, refund_id, event_ts	may arrive days later
chargeback	event_id, payment_id, case_id, event_ts	low volume, very late

Quality and correctness issues

At-least-once producers: duplicates possible (same event_id)
Out-of-order: clearing/refund/chargeback can arrive after authorization by hours/days
Schema evolution: new fields added monthly; occasional breaking changes from upstream
Backfills: compliance requires periodic reprocessing for corrected reference data (FX rates, merchant mapping)

Requirements

Functional requirements

Propose a DR design that supports regional failover for streaming ingestion and processing.
Ensure idempotent processing and reprocessing so that failover + retries do not create duplicates in Snowflake.
Support late-arriving events with deterministic updates to curated tables (e.g., settlement and reconciliation).
Provide a recovery runbook: how to fail over, how to fail back, and how to validate correctness.
Include a strategy for backfills after a multi-hour outage without violating downstream SLAs.

Non-functional requirements

Define explicit RPO/RTO targets for each layer (Kafka, raw landing, curated tables, orchestration metadata).
Meet PCI/SOX expectations: auditability, access controls, immutable raw logs, and provable completeness.
Minimize incremental cost: target < $60K/month additional spend.

Constraints

Cloud: AWS primary today; Snowflake is already in use.
Team: 6 data engineers, 1 SRE. Strong Spark/Airflow/dbt skills; moderate Kafka skills.
You cannot change the upstream payment event producers quickly (assume at-least-once semantics remain).
Some consumers require exactly-once effects in curated tables (settlement/recon), even if ingestion is at-least-once.

What we want from you (interview deliverables)

Explain and justify:

A target DR architecture (active-active vs active-passive) and why.
How you would replicate Kafka topics (or replace with an alternative) and handle consumer offsets/checkpoints.
How raw data landing in S3 is made durable across regions.
How Snowflake loads and dbt models remain consistent during failover.
How you detect data loss/duplication during and after recovery.
A testing plan: game days, chaos drills, and validation queries.

Context

Today, the platform runs a hybrid pipeline:

Streaming: Payment authorization/clearing events flow through Kafka into Spark Structured Streaming for enrichment and are landed to S3 and Snowflake for near-real-time analytics.
Batch: Nightly Airflow + Spark jobs compute merchant-level settlement aggregates and reconciliation tables in Snowflake using dbt.

Your task is to propose best-practice disaster recovery for this data processing system—covering architecture, operations, and correctness—while meeting strict business SLAs.

Scale Requirements

Ingest throughput: avg 8K events/sec, peak 35K events/sec
Event size: ~1.5 KB JSON (auth/clearing/chargeback/refund)
Daily volume: ~1.2B events/day (~1.8 TB/day raw)
Latency SLOs:
- Fraud feature tables: P95 < 60 seconds from event time
- Analytics tables in Snowflake: P95 < 5 minutes
- Settlement aggregates: available by 06:00 UTC daily
Retention:
- Kafka: 7 days
- S3 raw: 2 years (immutable)
- Snowflake curated: 7 years (audit)

Data Characteristics

Key event types

Event Type	Example Keys	Notes
authorization	event_id, payment_id, merchant_id, amount, currency, event_ts	high volume; may be retried by upstream
clearing	event_id, payment_id, network_ref, event_ts	can arrive hours late
refund	event_id, payment_id, refund_id, event_ts	may arrive days later
chargeback	event_id, payment_id, case_id, event_ts	low volume, very late

Quality and correctness issues

At-least-once producers: duplicates possible (same event_id)
Out-of-order: clearing/refund/chargeback can arrive after authorization by hours/days
Schema evolution: new fields added monthly; occasional breaking changes from upstream
Backfills: compliance requires periodic reprocessing for corrected reference data (FX rates, merchant mapping)

Requirements

Functional requirements

Propose a DR design that supports regional failover for streaming ingestion and processing.
Ensure idempotent processing and reprocessing so that failover + retries do not create duplicates in Snowflake.
Support late-arriving events with deterministic updates to curated tables (e.g., settlement and reconciliation).
Provide a recovery runbook: how to fail over, how to fail back, and how to validate correctness.
Include a strategy for backfills after a multi-hour outage without violating downstream SLAs.

Non-functional requirements

Define explicit RPO/RTO targets for each layer (Kafka, raw landing, curated tables, orchestration metadata).
Meet PCI/SOX expectations: auditability, access controls, immutable raw logs, and provable completeness.
Minimize incremental cost: target < $60K/month additional spend.

Constraints

Cloud: AWS primary today; Snowflake is already in use.
Team: 6 data engineers, 1 SRE. Strong Spark/Airflow/dbt skills; moderate Kafka skills.
You cannot change the upstream payment event producers quickly (assume at-least-once semantics remain).
Some consumers require exactly-once effects in curated tables (settlement/recon), even if ingestion is at-least-once.

What we want from you (interview deliverables)

Explain and justify:

A target DR architecture (active-active vs active-passive) and why.
How you would replicate Kafka topics (or replace with an alternative) and handle consumer offsets/checkpoints.
How raw data landing in S3 is made durable across regions.
How Snowflake loads and dbt models remain consistent during failover.
How you detect data loss/duplication during and after recovery.
A testing plan: game days, chaos drills, and validation queries.

Context

Today, the platform runs a hybrid pipeline:

Streaming: Payment authorization/clearing events flow through Kafka into Spark Structured Streaming for enrichment and are landed to S3 and Snowflake for near-real-time analytics.
Batch: Nightly Airflow + Spark jobs compute merchant-level settlement aggregates and reconciliation tables in Snowflake using dbt.

Your task is to propose best-practice disaster recovery for this data processing system—covering architecture, operations, and correctness—while meeting strict business SLAs.

Scale Requirements

Ingest throughput: avg 8K events/sec, peak 35K events/sec
Event size: ~1.5 KB JSON (auth/clearing/chargeback/refund)
Daily volume: ~1.2B events/day (~1.8 TB/day raw)
Latency SLOs:
- Fraud feature tables: P95 < 60 seconds from event time
- Analytics tables in Snowflake: P95 < 5 minutes
- Settlement aggregates: available by 06:00 UTC daily
Retention:
- Kafka: 7 days
- S3 raw: 2 years (immutable)
- Snowflake curated: 7 years (audit)

Data Characteristics

Key event types

Event Type	Example Keys	Notes
authorization	event_id, payment_id, merchant_id, amount, currency, event_ts	high volume; may be retried by upstream
clearing	event_id, payment_id, network_ref, event_ts	can arrive hours late
refund	event_id, payment_id, refund_id, event_ts	may arrive days later
chargeback	event_id, payment_id, case_id, event_ts	low volume, very late

Quality and correctness issues

At-least-once producers: duplicates possible (same event_id)
Out-of-order: clearing/refund/chargeback can arrive after authorization by hours/days
Schema evolution: new fields added monthly; occasional breaking changes from upstream
Backfills: compliance requires periodic reprocessing for corrected reference data (FX rates, merchant mapping)

Requirements

Functional requirements

Propose a DR design that supports regional failover for streaming ingestion and processing.
Ensure idempotent processing and reprocessing so that failover + retries do not create duplicates in Snowflake.
Support late-arriving events with deterministic updates to curated tables (e.g., settlement and reconciliation).
Provide a recovery runbook: how to fail over, how to fail back, and how to validate correctness.
Include a strategy for backfills after a multi-hour outage without violating downstream SLAs.

Non-functional requirements

Define explicit RPO/RTO targets for each layer (Kafka, raw landing, curated tables, orchestration metadata).
Meet PCI/SOX expectations: auditability, access controls, immutable raw logs, and provable completeness.
Minimize incremental cost: target < $60K/month additional spend.

Constraints

Cloud: AWS primary today; Snowflake is already in use.
Team: 6 data engineers, 1 SRE. Strong Spark/Airflow/dbt skills; moderate Kafka skills.
You cannot change the upstream payment event producers quickly (assume at-least-once semantics remain).
Some consumers require exactly-once effects in curated tables (settlement/recon), even if ingestion is at-least-once.

What we want from you (interview deliverables)

Explain and justify:

A target DR architecture (active-active vs active-passive) and why.
How you would replicate Kafka topics (or replace with an alternative) and handle consumer offsets/checkpoints.
How raw data landing in S3 is made durable across regions.
How Snowflake loads and dbt models remain consistent during failover.
How you detect data loss/duplication during and after recovery.
A testing plan: game days, chaos drills, and validation queries.

Context

Today, the platform runs a hybrid pipeline:

Streaming: Payment authorization/clearing events flow through Kafka into Spark Structured Streaming for enrichment and are landed to S3 and Snowflake for near-real-time analytics.
Batch: Nightly Airflow + Spark jobs compute merchant-level settlement aggregates and reconciliation tables in Snowflake using dbt.

Your task is to propose best-practice disaster recovery for this data processing system—covering architecture, operations, and correctness—while meeting strict business SLAs.

Scale Requirements

Ingest throughput: avg 8K events/sec, peak 35K events/sec
Event size: ~1.5 KB JSON (auth/clearing/chargeback/refund)
Daily volume: ~1.2B events/day (~1.8 TB/day raw)
Latency SLOs:
- Fraud feature tables: P95 < 60 seconds from event time
- Analytics tables in Snowflake: P95 < 5 minutes
- Settlement aggregates: available by 06:00 UTC daily
Retention:
- Kafka: 7 days
- S3 raw: 2 years (immutable)
- Snowflake curated: 7 years (audit)

Data Characteristics

Key event types

Event Type	Example Keys	Notes
authorization	event_id, payment_id, merchant_id, amount, currency, event_ts	high volume; may be retried by upstream
clearing	event_id, payment_id, network_ref, event_ts	can arrive hours late
refund	event_id, payment_id, refund_id, event_ts	may arrive days later
chargeback	event_id, payment_id, case_id, event_ts	low volume, very late

Quality and correctness issues

At-least-once producers: duplicates possible (same event_id)
Out-of-order: clearing/refund/chargeback can arrive after authorization by hours/days
Schema evolution: new fields added monthly; occasional breaking changes from upstream
Backfills: compliance requires periodic reprocessing for corrected reference data (FX rates, merchant mapping)

Requirements

Functional requirements

Propose a DR design that supports regional failover for streaming ingestion and processing.
Ensure idempotent processing and reprocessing so that failover + retries do not create duplicates in Snowflake.
Support late-arriving events with deterministic updates to curated tables (e.g., settlement and reconciliation).
Provide a recovery runbook: how to fail over, how to fail back, and how to validate correctness.
Include a strategy for backfills after a multi-hour outage without violating downstream SLAs.

Non-functional requirements

Define explicit RPO/RTO targets for each layer (Kafka, raw landing, curated tables, orchestration metadata).
Meet PCI/SOX expectations: auditability, access controls, immutable raw logs, and provable completeness.
Minimize incremental cost: target < $60K/month additional spend.

Constraints

Cloud: AWS primary today; Snowflake is already in use.
Team: 6 data engineers, 1 SRE. Strong Spark/Airflow/dbt skills; moderate Kafka skills.
You cannot change the upstream payment event producers quickly (assume at-least-once semantics remain).
Some consumers require exactly-once effects in curated tables (settlement/recon), even if ingestion is at-least-once.

What we want from you (interview deliverables)

Explain and justify:

A target DR architecture (active-active vs active-passive) and why.
How you would replicate Kafka topics (or replace with an alternative) and handle consumer offsets/checkpoints.
How raw data landing in S3 is made durable across regions.
How Snowflake loads and dbt models remain consistent during failover.
How you detect data loss/duplication during and after recovery.
A testing plan: game days, chaos drills, and validation queries.

Interview Guides

Context

Scale Requirements

Data Characteristics

Key event types

Quality and correctness issues

Requirements

Functional requirements

Non-functional requirements

Constraints

What we want from you (interview deliverables)

Disaster Recovery for Payments Pipelines

Context

Scale Requirements

Data Characteristics

Key event types

Quality and correctness issues

Requirements

Functional requirements

Non-functional requirements

Constraints

What we want from you (interview deliverables)

Your Answer

Disaster Recovery for Payments Pipelines

Context

Scale Requirements

Data Characteristics

Key event types

Quality and correctness issues

Requirements

Functional requirements

Non-functional requirements

Constraints

What we want from you (interview deliverables)

Disaster Recovery for Payments Pipelines

Context

Scale Requirements

Data Characteristics

Key event types

Quality and correctness issues

Requirements

Functional requirements

Non-functional requirements

Constraints

What we want from you (interview deliverables)

Your Answer