Idempotent API Ingestion With Timeouts

Context

You’re joining the Data Platform team at MercuryPay, a fintech payment processor that supports ~8M consumer wallets and ~220K merchants across North America. MercuryPay’s Risk & Reconciliation org uses analytics in Snowflake to reconcile payouts, detect double-charges, and satisfy SOX audit requirements. A critical upstream dependency is a third-party Payments Events API (owned by a banking partner) that exposes transaction lifecycle events (authorized, captured, refunded, chargeback, dispute updates).

Today, MercuryPay runs an hourly Airflow DAG that calls the partner API and lands JSON into S3, then a Spark job normalizes it into Snowflake. The API is notoriously unreliable: it frequently times out, sometimes returns HTTP 504 after having processed the request, and occasionally returns partial pages. This has caused duplicate events and missing events in downstream tables, leading to reconciliation drift and manual finance escalations.

Your task is to design a complete idempotency strategy for this ingestion pipeline so that retries, backfills, and partial failures do not create duplicates or gaps, while keeping the pipeline operationally manageable.

Scale Requirements

Event volume: ~120M events/day average, 300M/day peak (promotions + seasonal spikes)
API limits: 1,000 requests/min per API key; max page size 5,000 records
Latency SLO: 99% of events queryable in Snowflake within 15 minutes of partner availability
Backfill: Must support reprocessing up to 30 days of history without corrupting downstream state
Storage: Raw retention 90 days in S3; curated retention 7 years in Snowflake (audit)

Data Characteristics

The partner API supports two access patterns:

Cursor pagination: GET /events?updated_since=<ts>&page_size=5000&cursor=<token>
Event lookup: GET /events/<event_id>

Each event payload (JSON) includes:

Field	Type	Notes
event_id	string	Globally unique, but occasionally reused across environments (sandbox vs prod)
transaction_id	string	Stable identifier for a payment transaction
event_type	string	e.g., AUTHORIZED, CAPTURED, REFUNDED, CHARGEBACK_OPENED
event_version	int	Monotonic per (transaction_id, event_type) but can skip numbers
updated_at	timestamp	Partner’s last update time; can arrive out of order by up to 2 hours
occurred_at	timestamp	Business time of the event
payload	object	Nested fields; schema evolves monthly

Known quality issues:

Timeouts and ambiguous failures (did the partner process the request?)
Duplicate pages when retrying with the same cursor
Out-of-order delivery and late-arriving updates (e.g., dispute status changes)
Schema evolution: new nested fields appear without notice

Requirements

Functional

Idempotent ingestion: Multiple retries of the same logical fetch must not create duplicate records in raw or curated tables.
Exactly-once effect in Snowflake (practically): downstream tables must converge to the correct state even with at-least-once ingestion.
Support late-arriving data: updates up to 48 hours late must be incorporated.
Backfill safety: re-running any historical window (1 hour to 30 days) must be safe and must not require manual cleanup.
Auditability: keep immutable raw history and a clear lineage from raw → curated; be able to answer “why is this row here?”

Non-functional

Operational resilience: handle partner brownouts without paging every hour; retries must be bounded and observable.
Cost control: avoid full-table rewrites in Snowflake; prefer incremental MERGE patterns.
Security & compliance: PII in payload must be encrypted at rest; access controlled; changes logged (SOX).

Constraints

Existing stack: Airflow 2.x, AWS (S3, KMS, CloudWatch), Spark 3.x on EMR, Snowflake, dbt Core.
Team: 5 data engineers, on-call rotation; minimal appetite for running new stateful services.
Partner API does not guarantee exactly-once delivery and does not provide a reliable “snapshot version.”

What You Need to Produce (Interview Deliverables)

In your design, explicitly address:

Idempotency keys: What is the unique key for raw events vs curated facts? How do you handle environment collisions and schema changes?
State management: How do you track progress (watermarks/cursors) safely across retries and concurrent runs?
Retry semantics: How do you retry timeouts without duplicating pages? How do you detect partial page ingestion?
Storage layers: What do you store in S3 (raw) vs Snowflake (staging vs curated)? What is immutable?
Deduplication strategy: Where do you dedupe (Spark, Snowflake MERGE, both)? What is the trade-off?
Late data and backfills: How do you re-run a window and ensure convergence?
Data quality: What checks prevent silent data loss (missing pages, gaps in updated_at coverage)?
Observability: What metrics and alerts prove the pipeline is correct and progressing?

You may assume you can add small metadata tables in Snowflake (or DynamoDB if justified), but you should justify operational complexity.

Context

Scale Requirements

Event volume: ~120M events/day average, 300M/day peak (promotions + seasonal spikes)
API limits: 1,000 requests/min per API key; max page size 5,000 records
Latency SLO: 99% of events queryable in Snowflake within 15 minutes of partner availability
Backfill: Must support reprocessing up to 30 days of history without corrupting downstream state
Storage: Raw retention 90 days in S3; curated retention 7 years in Snowflake (audit)

Data Characteristics

The partner API supports two access patterns:

Cursor pagination: GET /events?updated_since=<ts>&page_size=5000&cursor=<token>
Event lookup: GET /events/<event_id>

Each event payload (JSON) includes:

Field	Type	Notes
event_id	string	Globally unique, but occasionally reused across environments (sandbox vs prod)
transaction_id	string	Stable identifier for a payment transaction
event_type	string	e.g., AUTHORIZED, CAPTURED, REFUNDED, CHARGEBACK_OPENED
event_version	int	Monotonic per (transaction_id, event_type) but can skip numbers
updated_at	timestamp	Partner’s last update time; can arrive out of order by up to 2 hours
occurred_at	timestamp	Business time of the event
payload	object	Nested fields; schema evolves monthly

Known quality issues:

Timeouts and ambiguous failures (did the partner process the request?)
Duplicate pages when retrying with the same cursor
Out-of-order delivery and late-arriving updates (e.g., dispute status changes)
Schema evolution: new nested fields appear without notice

Requirements

Functional

Idempotent ingestion: Multiple retries of the same logical fetch must not create duplicate records in raw or curated tables.
Exactly-once effect in Snowflake (practically): downstream tables must converge to the correct state even with at-least-once ingestion.
Support late-arriving data: updates up to 48 hours late must be incorporated.
Backfill safety: re-running any historical window (1 hour to 30 days) must be safe and must not require manual cleanup.
Auditability: keep immutable raw history and a clear lineage from raw → curated; be able to answer “why is this row here?”

Non-functional

Operational resilience: handle partner brownouts without paging every hour; retries must be bounded and observable.
Cost control: avoid full-table rewrites in Snowflake; prefer incremental MERGE patterns.
Security & compliance: PII in payload must be encrypted at rest; access controlled; changes logged (SOX).

Constraints

Existing stack: Airflow 2.x, AWS (S3, KMS, CloudWatch), Spark 3.x on EMR, Snowflake, dbt Core.
Team: 5 data engineers, on-call rotation; minimal appetite for running new stateful services.
Partner API does not guarantee exactly-once delivery and does not provide a reliable “snapshot version.”

What You Need to Produce (Interview Deliverables)

In your design, explicitly address:

Idempotency keys: What is the unique key for raw events vs curated facts? How do you handle environment collisions and schema changes?
State management: How do you track progress (watermarks/cursors) safely across retries and concurrent runs?
Retry semantics: How do you retry timeouts without duplicating pages? How do you detect partial page ingestion?
Storage layers: What do you store in S3 (raw) vs Snowflake (staging vs curated)? What is immutable?
Deduplication strategy: Where do you dedupe (Spark, Snowflake MERGE, both)? What is the trade-off?
Late data and backfills: How do you re-run a window and ensure convergence?
Data quality: What checks prevent silent data loss (missing pages, gaps in updated_at coverage)?
Observability: What metrics and alerts prove the pipeline is correct and progressing?

You may assume you can add small metadata tables in Snowflake (or DynamoDB if justified), but you should justify operational complexity.

Context

Scale Requirements

Event volume: ~120M events/day average, 300M/day peak (promotions + seasonal spikes)
API limits: 1,000 requests/min per API key; max page size 5,000 records
Latency SLO: 99% of events queryable in Snowflake within 15 minutes of partner availability
Backfill: Must support reprocessing up to 30 days of history without corrupting downstream state
Storage: Raw retention 90 days in S3; curated retention 7 years in Snowflake (audit)

Data Characteristics

The partner API supports two access patterns:

Cursor pagination: GET /events?updated_since=<ts>&page_size=5000&cursor=<token>
Event lookup: GET /events/<event_id>

Each event payload (JSON) includes:

Field	Type	Notes
event_id	string	Globally unique, but occasionally reused across environments (sandbox vs prod)
transaction_id	string	Stable identifier for a payment transaction
event_type	string	e.g., AUTHORIZED, CAPTURED, REFUNDED, CHARGEBACK_OPENED
event_version	int	Monotonic per (transaction_id, event_type) but can skip numbers
updated_at	timestamp	Partner’s last update time; can arrive out of order by up to 2 hours
occurred_at	timestamp	Business time of the event
payload	object	Nested fields; schema evolves monthly

Known quality issues:

Timeouts and ambiguous failures (did the partner process the request?)
Duplicate pages when retrying with the same cursor
Out-of-order delivery and late-arriving updates (e.g., dispute status changes)
Schema evolution: new nested fields appear without notice

Requirements

Functional

Idempotent ingestion: Multiple retries of the same logical fetch must not create duplicate records in raw or curated tables.
Exactly-once effect in Snowflake (practically): downstream tables must converge to the correct state even with at-least-once ingestion.
Support late-arriving data: updates up to 48 hours late must be incorporated.
Backfill safety: re-running any historical window (1 hour to 30 days) must be safe and must not require manual cleanup.
Auditability: keep immutable raw history and a clear lineage from raw → curated; be able to answer “why is this row here?”

Non-functional

Operational resilience: handle partner brownouts without paging every hour; retries must be bounded and observable.
Cost control: avoid full-table rewrites in Snowflake; prefer incremental MERGE patterns.
Security & compliance: PII in payload must be encrypted at rest; access controlled; changes logged (SOX).

Constraints

Existing stack: Airflow 2.x, AWS (S3, KMS, CloudWatch), Spark 3.x on EMR, Snowflake, dbt Core.
Team: 5 data engineers, on-call rotation; minimal appetite for running new stateful services.
Partner API does not guarantee exactly-once delivery and does not provide a reliable “snapshot version.”

What You Need to Produce (Interview Deliverables)

In your design, explicitly address:

Idempotency keys: What is the unique key for raw events vs curated facts? How do you handle environment collisions and schema changes?
State management: How do you track progress (watermarks/cursors) safely across retries and concurrent runs?
Retry semantics: How do you retry timeouts without duplicating pages? How do you detect partial page ingestion?
Storage layers: What do you store in S3 (raw) vs Snowflake (staging vs curated)? What is immutable?
Deduplication strategy: Where do you dedupe (Spark, Snowflake MERGE, both)? What is the trade-off?
Late data and backfills: How do you re-run a window and ensure convergence?
Data quality: What checks prevent silent data loss (missing pages, gaps in updated_at coverage)?
Observability: What metrics and alerts prove the pipeline is correct and progressing?

You may assume you can add small metadata tables in Snowflake (or DynamoDB if justified), but you should justify operational complexity.

Context

Scale Requirements

Event volume: ~120M events/day average, 300M/day peak (promotions + seasonal spikes)
API limits: 1,000 requests/min per API key; max page size 5,000 records
Latency SLO: 99% of events queryable in Snowflake within 15 minutes of partner availability
Backfill: Must support reprocessing up to 30 days of history without corrupting downstream state
Storage: Raw retention 90 days in S3; curated retention 7 years in Snowflake (audit)

Data Characteristics

The partner API supports two access patterns:

Cursor pagination: GET /events?updated_since=<ts>&page_size=5000&cursor=<token>
Event lookup: GET /events/<event_id>

Each event payload (JSON) includes:

Field	Type	Notes
event_id	string	Globally unique, but occasionally reused across environments (sandbox vs prod)
transaction_id	string	Stable identifier for a payment transaction
event_type	string	e.g., AUTHORIZED, CAPTURED, REFUNDED, CHARGEBACK_OPENED
event_version	int	Monotonic per (transaction_id, event_type) but can skip numbers
updated_at	timestamp	Partner’s last update time; can arrive out of order by up to 2 hours
occurred_at	timestamp	Business time of the event
payload	object	Nested fields; schema evolves monthly

Known quality issues:

Timeouts and ambiguous failures (did the partner process the request?)
Duplicate pages when retrying with the same cursor
Out-of-order delivery and late-arriving updates (e.g., dispute status changes)
Schema evolution: new nested fields appear without notice

Requirements

Functional

Idempotent ingestion: Multiple retries of the same logical fetch must not create duplicate records in raw or curated tables.
Exactly-once effect in Snowflake (practically): downstream tables must converge to the correct state even with at-least-once ingestion.
Support late-arriving data: updates up to 48 hours late must be incorporated.
Backfill safety: re-running any historical window (1 hour to 30 days) must be safe and must not require manual cleanup.
Auditability: keep immutable raw history and a clear lineage from raw → curated; be able to answer “why is this row here?”

Non-functional

Operational resilience: handle partner brownouts without paging every hour; retries must be bounded and observable.
Cost control: avoid full-table rewrites in Snowflake; prefer incremental MERGE patterns.
Security & compliance: PII in payload must be encrypted at rest; access controlled; changes logged (SOX).

Constraints

Existing stack: Airflow 2.x, AWS (S3, KMS, CloudWatch), Spark 3.x on EMR, Snowflake, dbt Core.
Team: 5 data engineers, on-call rotation; minimal appetite for running new stateful services.
Partner API does not guarantee exactly-once delivery and does not provide a reliable “snapshot version.”

What You Need to Produce (Interview Deliverables)

In your design, explicitly address:

Idempotency keys: What is the unique key for raw events vs curated facts? How do you handle environment collisions and schema changes?
State management: How do you track progress (watermarks/cursors) safely across retries and concurrent runs?
Retry semantics: How do you retry timeouts without duplicating pages? How do you detect partial page ingestion?
Storage layers: What do you store in S3 (raw) vs Snowflake (staging vs curated)? What is immutable?
Deduplication strategy: Where do you dedupe (Spark, Snowflake MERGE, both)? What is the trade-off?
Late data and backfills: How do you re-run a window and ensure convergence?
Data quality: What checks prevent silent data loss (missing pages, gaps in updated_at coverage)?
Observability: What metrics and alerts prove the pipeline is correct and progressing?

You may assume you can add small metadata tables in Snowflake (or DynamoDB if justified), but you should justify operational complexity.

Interview Guides

Context

Scale Requirements

Data Characteristics

Requirements

Functional

Non-functional

Constraints

What You Need to Produce (Interview Deliverables)

Idempotent API Ingestion With Timeouts

Context

Scale Requirements

Data Characteristics

Requirements

Functional

Non-functional

Constraints

What You Need to Produce (Interview Deliverables)

Your Answer

Idempotent API Ingestion With Timeouts

Context

Scale Requirements

Data Characteristics

Requirements

Functional

Non-functional

Constraints

What You Need to Produce (Interview Deliverables)

Idempotent API Ingestion With Timeouts

Context

Scale Requirements

Data Characteristics

Requirements

Functional

Non-functional

Constraints

What You Need to Produce (Interview Deliverables)

Your Answer