High-Availability Payments Event Pipeline

Context

You’re interviewing for a Senior Data Engineer role on PayWave, a fintech that provides card issuing and real-time fraud detection for mid-market banks. PayWave processes authorization, capture, refund, and chargeback events from card networks and bank partners. These events power: (1) real-time fraud models, (2) customer-facing transaction timelines in mobile apps, and (3) regulatory reporting (SOX + PCI-related audit trails).

Today, the data platform uses a mix of partner SFTP drops (hourly CSVs) and a “best-effort” event bus. The analytics warehouse (Snowflake) is updated every 2–4 hours via batch Spark jobs. This has caused multiple incidents: delayed fraud features (false positives spike), missing transactions in customer apps (support tickets), and incomplete audit extracts during a regulator inquiry. Leadership mandates a redesign focused on high availability and operational resilience without sacrificing correctness.

Current Architecture (Problems)

Layer	Current Tech	Problem
Ingestion	Partner SFTP + ad-hoc HTTP collectors	No consistent retry semantics; partial files; weak observability
Streaming	Single Kafka cluster in one region	Region/AZ outages cause multi-hour gaps
Processing	Spark batch on EMR every 2 hours	Stale data; reprocessing is manual and error-prone
Storage	Snowflake raw + curated	Loads fail silently; duplicates appear after retries
Orchestration	Airflow DAGs with cron schedules	No event-driven triggers; backfills overload clusters

You are asked to propose a complete pipeline design that optimizes for high availability across ingestion, processing, storage, orchestration, and data quality.

Scale Requirements

Throughput: average 120K events/sec, peak 450K events/sec during shopping holidays
Event size: ~1.2 KB JSON (after normalization)
Daily volume: ~8–35 TB/day raw depending on peak season
Freshness SLO: P95 < 2 minutes from event creation to queryable in Snowflake curated tables
Availability SLO: 99.95% for ingestion and processing (planned maintenance excluded)
RPO/RTO targets:
- RPO (data loss): 0 for committed events
- RTO (resume processing after AZ loss): < 10 minutes
Retention: 30 days hot (Kafka), 2 years in object storage (immutable), 7 years for audit extracts

Data Characteristics

Core event schema (simplified)

event_id (UUID, globally unique)
event_type (AUTH|CAPTURE|REFUND|CHARGEBACK)
event_ts (event time from issuer/network)
ingest_ts (platform ingest time)
account_id, card_id, merchant_id
amount, currency
network (VISA|MC|AMEX)
status (APPROVED|DECLINED|REVERSED)
attributes (map/JSON)

Known quality issues

Late-arriving events: up to 24 hours late for chargebacks; 5–30 minutes late for captures
Duplicates: partners retry with same event_id; sometimes resend with different event_id but same business keys
Out-of-order: capture can arrive before auth in rare cases
Schema evolution: partners add fields without notice

Requirements

Functional requirements

Ingest events from partner APIs and internal services into a durable log with at-least-once delivery.
Provide exactly-once results (or effectively-once) in curated Snowflake tables (no double-counted transactions).
Support late data and retractions/corrections (e.g., reversals) with deterministic outcomes.
Enable replay/backfill for any time range in the last 2 years without corrupting downstream tables.
Maintain an immutable raw/audit trail suitable for investigations.

Non-functional requirements (high availability focus)

Survive single AZ failure without manual intervention.
Minimize blast radius: isolate ingestion, stream processing, and warehouse loading failures.
Provide strong observability: end-to-end latency, consumer lag, data quality, and cost.
Secure data: PCI-adjacent controls (encryption, least privilege, audit logs).

Constraints

Cloud: AWS primary; Snowflake is already the enterprise warehouse.
Team: 6 data engineers, 1 SRE; strong Spark/Airflow, moderate Kafka.
Budget: +$60K/month incremental spend is acceptable if justified.
Compliance: must keep immutable raw logs; must be able to produce an audit extract within 24 hours.

Interview Task

Design the end-to-end pipeline and explain specific strategies to optimize for high availability. Your answer should cover:

Kafka topology and replication strategy (AZ/region), partitioning, retention, and schema governance
Stream processing design (Spark Structured Streaming): checkpoints, watermarking, dedupe, and idempotent sinks
Snowflake loading pattern (Snowpipe/STREAM/TASK or alternative) and how you avoid duplicates
Orchestration strategy (Airflow): event-driven triggers, backfills, and safe retries
Data quality gates and how they interact with availability (fail-open vs fail-closed)
Monitoring/alerting and incident response runbooks

Be explicit about trade-offs (cost vs availability, consistency vs latency) and how you’d test failure scenarios (AZ loss, broker loss, schema breaking change, warehouse outage).

Context

Current Architecture (Problems)

Layer	Current Tech	Problem
Ingestion	Partner SFTP + ad-hoc HTTP collectors	No consistent retry semantics; partial files; weak observability
Streaming	Single Kafka cluster in one region	Region/AZ outages cause multi-hour gaps
Processing	Spark batch on EMR every 2 hours	Stale data; reprocessing is manual and error-prone
Storage	Snowflake raw + curated	Loads fail silently; duplicates appear after retries
Orchestration	Airflow DAGs with cron schedules	No event-driven triggers; backfills overload clusters

You are asked to propose a complete pipeline design that optimizes for high availability across ingestion, processing, storage, orchestration, and data quality.

Scale Requirements

Throughput: average 120K events/sec, peak 450K events/sec during shopping holidays
Event size: ~1.2 KB JSON (after normalization)
Daily volume: ~8–35 TB/day raw depending on peak season
Freshness SLO: P95 < 2 minutes from event creation to queryable in Snowflake curated tables
Availability SLO: 99.95% for ingestion and processing (planned maintenance excluded)
RPO/RTO targets:
- RPO (data loss): 0 for committed events
- RTO (resume processing after AZ loss): < 10 minutes
Retention: 30 days hot (Kafka), 2 years in object storage (immutable), 7 years for audit extracts

Data Characteristics

Core event schema (simplified)

event_id (UUID, globally unique)
event_type (AUTH|CAPTURE|REFUND|CHARGEBACK)
event_ts (event time from issuer/network)
ingest_ts (platform ingest time)
account_id, card_id, merchant_id
amount, currency
network (VISA|MC|AMEX)
status (APPROVED|DECLINED|REVERSED)
attributes (map/JSON)

Known quality issues

Late-arriving events: up to 24 hours late for chargebacks; 5–30 minutes late for captures
Duplicates: partners retry with same event_id; sometimes resend with different event_id but same business keys
Out-of-order: capture can arrive before auth in rare cases
Schema evolution: partners add fields without notice

Requirements

Functional requirements

Ingest events from partner APIs and internal services into a durable log with at-least-once delivery.
Provide exactly-once results (or effectively-once) in curated Snowflake tables (no double-counted transactions).
Support late data and retractions/corrections (e.g., reversals) with deterministic outcomes.
Enable replay/backfill for any time range in the last 2 years without corrupting downstream tables.
Maintain an immutable raw/audit trail suitable for investigations.

Non-functional requirements (high availability focus)

Survive single AZ failure without manual intervention.
Minimize blast radius: isolate ingestion, stream processing, and warehouse loading failures.
Provide strong observability: end-to-end latency, consumer lag, data quality, and cost.
Secure data: PCI-adjacent controls (encryption, least privilege, audit logs).

Constraints

Cloud: AWS primary; Snowflake is already the enterprise warehouse.
Team: 6 data engineers, 1 SRE; strong Spark/Airflow, moderate Kafka.
Budget: +$60K/month incremental spend is acceptable if justified.
Compliance: must keep immutable raw logs; must be able to produce an audit extract within 24 hours.

Interview Task

Design the end-to-end pipeline and explain specific strategies to optimize for high availability. Your answer should cover:

Kafka topology and replication strategy (AZ/region), partitioning, retention, and schema governance
Stream processing design (Spark Structured Streaming): checkpoints, watermarking, dedupe, and idempotent sinks
Snowflake loading pattern (Snowpipe/STREAM/TASK or alternative) and how you avoid duplicates
Orchestration strategy (Airflow): event-driven triggers, backfills, and safe retries
Data quality gates and how they interact with availability (fail-open vs fail-closed)
Monitoring/alerting and incident response runbooks

Be explicit about trade-offs (cost vs availability, consistency vs latency) and how you’d test failure scenarios (AZ loss, broker loss, schema breaking change, warehouse outage).

Context

Current Architecture (Problems)

Layer	Current Tech	Problem
Ingestion	Partner SFTP + ad-hoc HTTP collectors	No consistent retry semantics; partial files; weak observability
Streaming	Single Kafka cluster in one region	Region/AZ outages cause multi-hour gaps
Processing	Spark batch on EMR every 2 hours	Stale data; reprocessing is manual and error-prone
Storage	Snowflake raw + curated	Loads fail silently; duplicates appear after retries
Orchestration	Airflow DAGs with cron schedules	No event-driven triggers; backfills overload clusters

You are asked to propose a complete pipeline design that optimizes for high availability across ingestion, processing, storage, orchestration, and data quality.

Scale Requirements

Throughput: average 120K events/sec, peak 450K events/sec during shopping holidays
Event size: ~1.2 KB JSON (after normalization)
Daily volume: ~8–35 TB/day raw depending on peak season
Freshness SLO: P95 < 2 minutes from event creation to queryable in Snowflake curated tables
Availability SLO: 99.95% for ingestion and processing (planned maintenance excluded)
RPO/RTO targets:
- RPO (data loss): 0 for committed events
- RTO (resume processing after AZ loss): < 10 minutes
Retention: 30 days hot (Kafka), 2 years in object storage (immutable), 7 years for audit extracts

Data Characteristics

Core event schema (simplified)

event_id (UUID, globally unique)
event_type (AUTH|CAPTURE|REFUND|CHARGEBACK)
event_ts (event time from issuer/network)
ingest_ts (platform ingest time)
account_id, card_id, merchant_id
amount, currency
network (VISA|MC|AMEX)
status (APPROVED|DECLINED|REVERSED)
attributes (map/JSON)

Known quality issues

Late-arriving events: up to 24 hours late for chargebacks; 5–30 minutes late for captures
Duplicates: partners retry with same event_id; sometimes resend with different event_id but same business keys
Out-of-order: capture can arrive before auth in rare cases
Schema evolution: partners add fields without notice

Requirements

Functional requirements

Ingest events from partner APIs and internal services into a durable log with at-least-once delivery.
Provide exactly-once results (or effectively-once) in curated Snowflake tables (no double-counted transactions).
Support late data and retractions/corrections (e.g., reversals) with deterministic outcomes.
Enable replay/backfill for any time range in the last 2 years without corrupting downstream tables.
Maintain an immutable raw/audit trail suitable for investigations.

Non-functional requirements (high availability focus)

Survive single AZ failure without manual intervention.
Minimize blast radius: isolate ingestion, stream processing, and warehouse loading failures.
Provide strong observability: end-to-end latency, consumer lag, data quality, and cost.
Secure data: PCI-adjacent controls (encryption, least privilege, audit logs).

Constraints

Cloud: AWS primary; Snowflake is already the enterprise warehouse.
Team: 6 data engineers, 1 SRE; strong Spark/Airflow, moderate Kafka.
Budget: +$60K/month incremental spend is acceptable if justified.
Compliance: must keep immutable raw logs; must be able to produce an audit extract within 24 hours.

Interview Task

Design the end-to-end pipeline and explain specific strategies to optimize for high availability. Your answer should cover:

Kafka topology and replication strategy (AZ/region), partitioning, retention, and schema governance
Stream processing design (Spark Structured Streaming): checkpoints, watermarking, dedupe, and idempotent sinks
Snowflake loading pattern (Snowpipe/STREAM/TASK or alternative) and how you avoid duplicates
Orchestration strategy (Airflow): event-driven triggers, backfills, and safe retries
Data quality gates and how they interact with availability (fail-open vs fail-closed)
Monitoring/alerting and incident response runbooks

Be explicit about trade-offs (cost vs availability, consistency vs latency) and how you’d test failure scenarios (AZ loss, broker loss, schema breaking change, warehouse outage).

Context

Current Architecture (Problems)

Layer	Current Tech	Problem
Ingestion	Partner SFTP + ad-hoc HTTP collectors	No consistent retry semantics; partial files; weak observability
Streaming	Single Kafka cluster in one region	Region/AZ outages cause multi-hour gaps
Processing	Spark batch on EMR every 2 hours	Stale data; reprocessing is manual and error-prone
Storage	Snowflake raw + curated	Loads fail silently; duplicates appear after retries
Orchestration	Airflow DAGs with cron schedules	No event-driven triggers; backfills overload clusters

You are asked to propose a complete pipeline design that optimizes for high availability across ingestion, processing, storage, orchestration, and data quality.

Scale Requirements

Throughput: average 120K events/sec, peak 450K events/sec during shopping holidays
Event size: ~1.2 KB JSON (after normalization)
Daily volume: ~8–35 TB/day raw depending on peak season
Freshness SLO: P95 < 2 minutes from event creation to queryable in Snowflake curated tables
Availability SLO: 99.95% for ingestion and processing (planned maintenance excluded)
RPO/RTO targets:
- RPO (data loss): 0 for committed events
- RTO (resume processing after AZ loss): < 10 minutes
Retention: 30 days hot (Kafka), 2 years in object storage (immutable), 7 years for audit extracts

Data Characteristics

Core event schema (simplified)

event_id (UUID, globally unique)
event_type (AUTH|CAPTURE|REFUND|CHARGEBACK)
event_ts (event time from issuer/network)
ingest_ts (platform ingest time)
account_id, card_id, merchant_id
amount, currency
network (VISA|MC|AMEX)
status (APPROVED|DECLINED|REVERSED)
attributes (map/JSON)

Known quality issues

Late-arriving events: up to 24 hours late for chargebacks; 5–30 minutes late for captures
Duplicates: partners retry with same event_id; sometimes resend with different event_id but same business keys
Out-of-order: capture can arrive before auth in rare cases
Schema evolution: partners add fields without notice

Requirements

Functional requirements

Ingest events from partner APIs and internal services into a durable log with at-least-once delivery.
Provide exactly-once results (or effectively-once) in curated Snowflake tables (no double-counted transactions).
Support late data and retractions/corrections (e.g., reversals) with deterministic outcomes.
Enable replay/backfill for any time range in the last 2 years without corrupting downstream tables.
Maintain an immutable raw/audit trail suitable for investigations.

Non-functional requirements (high availability focus)

Survive single AZ failure without manual intervention.
Minimize blast radius: isolate ingestion, stream processing, and warehouse loading failures.
Provide strong observability: end-to-end latency, consumer lag, data quality, and cost.
Secure data: PCI-adjacent controls (encryption, least privilege, audit logs).

Constraints

Cloud: AWS primary; Snowflake is already the enterprise warehouse.
Team: 6 data engineers, 1 SRE; strong Spark/Airflow, moderate Kafka.
Budget: +$60K/month incremental spend is acceptable if justified.
Compliance: must keep immutable raw logs; must be able to produce an audit extract within 24 hours.

Interview Task

Design the end-to-end pipeline and explain specific strategies to optimize for high availability. Your answer should cover:

Kafka topology and replication strategy (AZ/region), partitioning, retention, and schema governance
Stream processing design (Spark Structured Streaming): checkpoints, watermarking, dedupe, and idempotent sinks
Snowflake loading pattern (Snowpipe/STREAM/TASK or alternative) and how you avoid duplicates
Orchestration strategy (Airflow): event-driven triggers, backfills, and safe retries
Data quality gates and how they interact with availability (fail-open vs fail-closed)
Monitoring/alerting and incident response runbooks

Be explicit about trade-offs (cost vs availability, consistency vs latency) and how you’d test failure scenarios (AZ loss, broker loss, schema breaking change, warehouse outage).

Interview Guides

Context

Current Architecture (Problems)

Scale Requirements

Data Characteristics

Core event schema (simplified)

Known quality issues

Requirements

Functional requirements

Non-functional requirements (high availability focus)

Constraints

Interview Task

High-Availability Payments Event Pipeline

Context

Current Architecture (Problems)

Scale Requirements

Data Characteristics

Core event schema (simplified)

Known quality issues

Requirements

Functional requirements

Non-functional requirements (high availability focus)

Constraints

Interview Task

Your Answer

High-Availability Payments Event Pipeline

Context

Current Architecture (Problems)

Scale Requirements

Data Characteristics

Core event schema (simplified)

Known quality issues

Requirements

Functional requirements

Non-functional requirements (high availability focus)

Constraints

Interview Task

High-Availability Payments Event Pipeline

Context

Current Architecture (Problems)

Scale Requirements

Data Characteristics

Core event schema (simplified)

Known quality issues

Requirements

Functional requirements

Non-functional requirements (high availability focus)

Constraints

Interview Task

Your Answer