Context
You’re interviewing with the Payments Data Platform team at a large fintech that processes card and ACH transactions for ~120K merchants. The company supports 18M consumer wallets and peaks at 35K payment events/sec during regional shopping spikes. Data powers (1) real-time fraud features, (2) merchant settlement and reconciliation, and (3) regulatory reporting (SOX + PCI). A prolonged outage or incorrect reprocessing can lead to incorrect settlements, double-charging, or regulatory audit findings.
Today, the platform runs a hybrid pipeline:
- Streaming: Payment authorization/clearing events flow through Kafka into Spark Structured Streaming for enrichment and are landed to S3 and Snowflake for near-real-time analytics.
- Batch: Nightly Airflow + Spark jobs compute merchant-level settlement aggregates and reconciliation tables in Snowflake using dbt.
The current DR posture is weak: Kafka is single-region, Airflow metadata is single-AZ, and Spark checkpoints live in the same region as compute. A recent cloud incident caused a 2-hour regional impairment; the team recovered manually but had to re-run jobs and later discovered duplicate rows in Snowflake and mismatched settlement totals.
Your task is to propose best-practice disaster recovery for this data processing system—covering architecture, operations, and correctness—while meeting strict business SLAs.
Scale Requirements
- Ingest throughput: avg 8K events/sec, peak 35K events/sec
- Event size: ~1.5 KB JSON (auth/clearing/chargeback/refund)
- Daily volume: ~1.2B events/day (~1.8 TB/day raw)
- Latency SLOs:
- Fraud feature tables: P95 < 60 seconds from event time
- Analytics tables in Snowflake: P95 < 5 minutes
- Settlement aggregates: available by 06:00 UTC daily
- Retention:
- Kafka: 7 days
- S3 raw: 2 years (immutable)
- Snowflake curated: 7 years (audit)
Data Characteristics
Key event types
| Event Type | Example Keys | Notes |
|---|
| authorization | event_id, payment_id, merchant_id, amount, currency, event_ts | high volume; may be retried by upstream |
| clearing | event_id, payment_id, network_ref, event_ts | can arrive hours late |
| refund | event_id, payment_id, refund_id, event_ts | may arrive days later |
| chargeback | event_id, payment_id, case_id, event_ts | low volume, very late |
Quality and correctness issues
- At-least-once producers: duplicates possible (same
event_id)
- Out-of-order: clearing/refund/chargeback can arrive after authorization by hours/days
- Schema evolution: new fields added monthly; occasional breaking changes from upstream
- Backfills: compliance requires periodic reprocessing for corrected reference data (FX rates, merchant mapping)
Requirements
Functional requirements
- Propose a DR design that supports regional failover for streaming ingestion and processing.
- Ensure idempotent processing and reprocessing so that failover + retries do not create duplicates in Snowflake.
- Support late-arriving events with deterministic updates to curated tables (e.g., settlement and reconciliation).
- Provide a recovery runbook: how to fail over, how to fail back, and how to validate correctness.
- Include a strategy for backfills after a multi-hour outage without violating downstream SLAs.
Non-functional requirements
- Define explicit RPO/RTO targets for each layer (Kafka, raw landing, curated tables, orchestration metadata).
- Meet PCI/SOX expectations: auditability, access controls, immutable raw logs, and provable completeness.
- Minimize incremental cost: target < $60K/month additional spend.
Constraints
- Cloud: AWS primary today; Snowflake is already in use.
- Team: 6 data engineers, 1 SRE. Strong Spark/Airflow/dbt skills; moderate Kafka skills.
- You cannot change the upstream payment event producers quickly (assume at-least-once semantics remain).
- Some consumers require exactly-once effects in curated tables (settlement/recon), even if ingestion is at-least-once.
What we want from you (interview deliverables)
Explain and justify:
- A target DR architecture (active-active vs active-passive) and why.
- How you would replicate Kafka topics (or replace with an alternative) and handle consumer offsets/checkpoints.
- How raw data landing in S3 is made durable across regions.
- How Snowflake loads and dbt models remain consistent during failover.
- How you detect data loss/duplication during and after recovery.
- A testing plan: game days, chaos drills, and validation queries.