Context
You’re joining the Data Platform team at MercuryPay, a fintech payment processor that supports ~8M consumer wallets and ~220K merchants across North America. MercuryPay’s Risk & Reconciliation org uses analytics in Snowflake to reconcile payouts, detect double-charges, and satisfy SOX audit requirements. A critical upstream dependency is a third-party Payments Events API (owned by a banking partner) that exposes transaction lifecycle events (authorized, captured, refunded, chargeback, dispute updates).
Today, MercuryPay runs an hourly Airflow DAG that calls the partner API and lands JSON into S3, then a Spark job normalizes it into Snowflake. The API is notoriously unreliable: it frequently times out, sometimes returns HTTP 504 after having processed the request, and occasionally returns partial pages. This has caused duplicate events and missing events in downstream tables, leading to reconciliation drift and manual finance escalations.
Your task is to design a complete idempotency strategy for this ingestion pipeline so that retries, backfills, and partial failures do not create duplicates or gaps, while keeping the pipeline operationally manageable.
Scale Requirements
- Event volume: ~120M events/day average, 300M/day peak (promotions + seasonal spikes)
- API limits: 1,000 requests/min per API key; max page size 5,000 records
- Latency SLO: 99% of events queryable in Snowflake within 15 minutes of partner availability
- Backfill: Must support reprocessing up to 30 days of history without corrupting downstream state
- Storage: Raw retention 90 days in S3; curated retention 7 years in Snowflake (audit)
Data Characteristics
The partner API supports two access patterns:
- Cursor pagination:
GET /events?updated_since=<ts>&page_size=5000&cursor=<token>
- Event lookup:
GET /events/<event_id>
Each event payload (JSON) includes:
| Field | Type | Notes |
|---|
| event_id | string | Globally unique, but occasionally reused across environments (sandbox vs prod) |
| transaction_id | string | Stable identifier for a payment transaction |
| event_type | string | e.g., AUTHORIZED, CAPTURED, REFUNDED, CHARGEBACK_OPENED |
| event_version | int | Monotonic per (transaction_id, event_type) but can skip numbers |
| updated_at | timestamp | Partner’s last update time; can arrive out of order by up to 2 hours |
| occurred_at | timestamp | Business time of the event |
| payload | object | Nested fields; schema evolves monthly |
Known quality issues:
- Timeouts and ambiguous failures (did the partner process the request?)
- Duplicate pages when retrying with the same cursor
- Out-of-order delivery and late-arriving updates (e.g., dispute status changes)
- Schema evolution: new nested fields appear without notice
Requirements
Functional
- Idempotent ingestion: Multiple retries of the same logical fetch must not create duplicate records in raw or curated tables.
- Exactly-once effect in Snowflake (practically): downstream tables must converge to the correct state even with at-least-once ingestion.
- Support late-arriving data: updates up to 48 hours late must be incorporated.
- Backfill safety: re-running any historical window (1 hour to 30 days) must be safe and must not require manual cleanup.
- Auditability: keep immutable raw history and a clear lineage from raw → curated; be able to answer “why is this row here?”
Non-functional
- Operational resilience: handle partner brownouts without paging every hour; retries must be bounded and observable.
- Cost control: avoid full-table rewrites in Snowflake; prefer incremental MERGE patterns.
- Security & compliance: PII in payload must be encrypted at rest; access controlled; changes logged (SOX).
Constraints
- Existing stack: Airflow 2.x, AWS (S3, KMS, CloudWatch), Spark 3.x on EMR, Snowflake, dbt Core.
- Team: 5 data engineers, on-call rotation; minimal appetite for running new stateful services.
- Partner API does not guarantee exactly-once delivery and does not provide a reliable “snapshot version.”
What You Need to Produce (Interview Deliverables)
In your design, explicitly address:
- Idempotency keys: What is the unique key for raw events vs curated facts? How do you handle environment collisions and schema changes?
- State management: How do you track progress (watermarks/cursors) safely across retries and concurrent runs?
- Retry semantics: How do you retry timeouts without duplicating pages? How do you detect partial page ingestion?
- Storage layers: What do you store in S3 (raw) vs Snowflake (staging vs curated)? What is immutable?
- Deduplication strategy: Where do you dedupe (Spark, Snowflake MERGE, both)? What is the trade-off?
- Late data and backfills: How do you re-run a window and ensure convergence?
- Data quality: What checks prevent silent data loss (missing pages, gaps in updated_at coverage)?
- Observability: What metrics and alerts prove the pipeline is correct and progressing?
You may assume you can add small metadata tables in Snowflake (or DynamoDB if justified), but you should justify operational complexity.