Context
You’re interviewing for a Senior Data Engineer role on PayWave, a fintech that provides card issuing and real-time fraud detection for mid-market banks. PayWave processes authorization, capture, refund, and chargeback events from card networks and bank partners. These events power: (1) real-time fraud models, (2) customer-facing transaction timelines in mobile apps, and (3) regulatory reporting (SOX + PCI-related audit trails).
Today, the data platform uses a mix of partner SFTP drops (hourly CSVs) and a “best-effort” event bus. The analytics warehouse (Snowflake) is updated every 2–4 hours via batch Spark jobs. This has caused multiple incidents: delayed fraud features (false positives spike), missing transactions in customer apps (support tickets), and incomplete audit extracts during a regulator inquiry. Leadership mandates a redesign focused on high availability and operational resilience without sacrificing correctness.
Current Architecture (Problems)
| Layer | Current Tech | Problem |
|---|
| Ingestion | Partner SFTP + ad-hoc HTTP collectors | No consistent retry semantics; partial files; weak observability |
| Streaming | Single Kafka cluster in one region | Region/AZ outages cause multi-hour gaps |
| Processing | Spark batch on EMR every 2 hours | Stale data; reprocessing is manual and error-prone |
| Storage | Snowflake raw + curated | Loads fail silently; duplicates appear after retries |
| Orchestration | Airflow DAGs with cron schedules | No event-driven triggers; backfills overload clusters |
You are asked to propose a complete pipeline design that optimizes for high availability across ingestion, processing, storage, orchestration, and data quality.
Scale Requirements
- Throughput: average 120K events/sec, peak 450K events/sec during shopping holidays
- Event size: ~1.2 KB JSON (after normalization)
- Daily volume: ~8–35 TB/day raw depending on peak season
- Freshness SLO: P95 < 2 minutes from event creation to queryable in Snowflake curated tables
- Availability SLO: 99.95% for ingestion and processing (planned maintenance excluded)
- RPO/RTO targets:
- RPO (data loss): 0 for committed events
- RTO (resume processing after AZ loss): < 10 minutes
- Retention: 30 days hot (Kafka), 2 years in object storage (immutable), 7 years for audit extracts
Data Characteristics
Core event schema (simplified)
event_id (UUID, globally unique)
event_type (AUTH|CAPTURE|REFUND|CHARGEBACK)
event_ts (event time from issuer/network)
ingest_ts (platform ingest time)
account_id, card_id, merchant_id
amount, currency
network (VISA|MC|AMEX)
status (APPROVED|DECLINED|REVERSED)
attributes (map/JSON)
Known quality issues
- Late-arriving events: up to 24 hours late for chargebacks; 5–30 minutes late for captures
- Duplicates: partners retry with same
event_id; sometimes resend with different event_id but same business keys
- Out-of-order: capture can arrive before auth in rare cases
- Schema evolution: partners add fields without notice
Requirements
Functional requirements
- Ingest events from partner APIs and internal services into a durable log with at-least-once delivery.
- Provide exactly-once results (or effectively-once) in curated Snowflake tables (no double-counted transactions).
- Support late data and retractions/corrections (e.g., reversals) with deterministic outcomes.
- Enable replay/backfill for any time range in the last 2 years without corrupting downstream tables.
- Maintain an immutable raw/audit trail suitable for investigations.
Non-functional requirements (high availability focus)
- Survive single AZ failure without manual intervention.
- Minimize blast radius: isolate ingestion, stream processing, and warehouse loading failures.
- Provide strong observability: end-to-end latency, consumer lag, data quality, and cost.
- Secure data: PCI-adjacent controls (encryption, least privilege, audit logs).
Constraints
- Cloud: AWS primary; Snowflake is already the enterprise warehouse.
- Team: 6 data engineers, 1 SRE; strong Spark/Airflow, moderate Kafka.
- Budget: +$60K/month incremental spend is acceptable if justified.
- Compliance: must keep immutable raw logs; must be able to produce an audit extract within 24 hours.
Interview Task
Design the end-to-end pipeline and explain specific strategies to optimize for high availability. Your answer should cover:
- Kafka topology and replication strategy (AZ/region), partitioning, retention, and schema governance
- Stream processing design (Spark Structured Streaming): checkpoints, watermarking, dedupe, and idempotent sinks
- Snowflake loading pattern (Snowpipe/STREAM/TASK or alternative) and how you avoid duplicates
- Orchestration strategy (Airflow): event-driven triggers, backfills, and safe retries
- Data quality gates and how they interact with availability (fail-open vs fail-closed)
- Monitoring/alerting and incident response runbooks
Be explicit about trade-offs (cost vs availability, consistency vs latency) and how you’d test failure scenarios (AZ loss, broker loss, schema breaking change, warehouse outage).