Design Idempotent Async Content Pipeline

Context

Airtable is building asynchronous AI-powered content generation for records in Interfaces and Automations, such as generating summaries, descriptions, or follow-up drafts from record fields. Today, requests are triggered from multiple surfaces and handled by stateless workers, but retries after worker crashes, upstream timeouts, or webhook replays can create duplicate generations, inconsistent record state, and wasted LLM spend.

Design a pipeline that safely processes content-generation jobs exactly once from the product perspective, while using at-least-once infrastructure underneath. Assume Airtable runs on AWS and wants a durable, observable pipeline that supports both user-triggered generation and bulk backfills.

Scale Requirements

Peak ingest: 25K generation requests/minute during business hours
Average payload: 8 KB request metadata + up to 64 KB prompt context
LLM latency: P50 4s, P95 20s, occasional 60s timeouts
Freshness target: 99% of successful jobs reflected in Airtable record state within 2 minutes
Storage: 180 days of job history, 30 days of raw request/response payloads
Retry behavior: tolerate duplicate delivery from queue, webhook replay, and worker restarts

Requirements

Ingest requests from Airtable Automations, Airtable Interfaces, and internal APIs into a single durable pipeline.
Guarantee idempotent job execution using a deterministic idempotency key derived from base_id, table_id, record_id, field_id, prompt_version, and input hash.
Prevent duplicate work when the same request is submitted concurrently or retried after partial failure.
Support safe retries for transient failures from the LLM provider, network timeouts, and downstream write conflicts.
Persist job state transitions (queued, leased, running, succeeded, failed, dead_lettered) with auditability.
Write results back to Airtable records only if the result is still current for that record version.
Provide a backfill path for reprocessing millions of historical records after prompt-template changes.
Define monitoring, data quality checks, and operational recovery procedures.

Constraints

Use AWS-managed services where possible; team size is 5 engineers.
Minimize duplicate LLM calls because each call has material cost.
PII may appear in prompts and outputs; encryption and access logging are required.
The design must tolerate regional worker restarts and at-least-once delivery semantics.

Context

Scale Requirements

Peak ingest: 25K generation requests/minute during business hours
Average payload: 8 KB request metadata + up to 64 KB prompt context
LLM latency: P50 4s, P95 20s, occasional 60s timeouts
Freshness target: 99% of successful jobs reflected in Airtable record state within 2 minutes
Storage: 180 days of job history, 30 days of raw request/response payloads
Retry behavior: tolerate duplicate delivery from queue, webhook replay, and worker restarts

Requirements

Ingest requests from Airtable Automations, Airtable Interfaces, and internal APIs into a single durable pipeline.
Guarantee idempotent job execution using a deterministic idempotency key derived from base_id, table_id, record_id, field_id, prompt_version, and input hash.
Prevent duplicate work when the same request is submitted concurrently or retried after partial failure.
Support safe retries for transient failures from the LLM provider, network timeouts, and downstream write conflicts.
Persist job state transitions (queued, leased, running, succeeded, failed, dead_lettered) with auditability.
Write results back to Airtable records only if the result is still current for that record version.
Provide a backfill path for reprocessing millions of historical records after prompt-template changes.
Define monitoring, data quality checks, and operational recovery procedures.

Constraints

Use AWS-managed services where possible; team size is 5 engineers.
Minimize duplicate LLM calls because each call has material cost.
PII may appear in prompts and outputs; encryption and access logging are required.
The design must tolerate regional worker restarts and at-least-once delivery semantics.

Context

Scale Requirements

Peak ingest: 25K generation requests/minute during business hours
Average payload: 8 KB request metadata + up to 64 KB prompt context
LLM latency: P50 4s, P95 20s, occasional 60s timeouts
Freshness target: 99% of successful jobs reflected in Airtable record state within 2 minutes
Storage: 180 days of job history, 30 days of raw request/response payloads
Retry behavior: tolerate duplicate delivery from queue, webhook replay, and worker restarts

Requirements

Ingest requests from Airtable Automations, Airtable Interfaces, and internal APIs into a single durable pipeline.
Guarantee idempotent job execution using a deterministic idempotency key derived from base_id, table_id, record_id, field_id, prompt_version, and input hash.
Prevent duplicate work when the same request is submitted concurrently or retried after partial failure.
Support safe retries for transient failures from the LLM provider, network timeouts, and downstream write conflicts.
Persist job state transitions (queued, leased, running, succeeded, failed, dead_lettered) with auditability.
Write results back to Airtable records only if the result is still current for that record version.
Provide a backfill path for reprocessing millions of historical records after prompt-template changes.
Define monitoring, data quality checks, and operational recovery procedures.

Constraints

Use AWS-managed services where possible; team size is 5 engineers.
Minimize duplicate LLM calls because each call has material cost.
PII may appear in prompts and outputs; encryption and access logging are required.
The design must tolerate regional worker restarts and at-least-once delivery semantics.

Context

Scale Requirements

Peak ingest: 25K generation requests/minute during business hours
Average payload: 8 KB request metadata + up to 64 KB prompt context
LLM latency: P50 4s, P95 20s, occasional 60s timeouts
Freshness target: 99% of successful jobs reflected in Airtable record state within 2 minutes
Storage: 180 days of job history, 30 days of raw request/response payloads
Retry behavior: tolerate duplicate delivery from queue, webhook replay, and worker restarts

Requirements

Ingest requests from Airtable Automations, Airtable Interfaces, and internal APIs into a single durable pipeline.
Guarantee idempotent job execution using a deterministic idempotency key derived from base_id, table_id, record_id, field_id, prompt_version, and input hash.
Prevent duplicate work when the same request is submitted concurrently or retried after partial failure.
Support safe retries for transient failures from the LLM provider, network timeouts, and downstream write conflicts.
Persist job state transitions (queued, leased, running, succeeded, failed, dead_lettered) with auditability.
Write results back to Airtable records only if the result is still current for that record version.
Provide a backfill path for reprocessing millions of historical records after prompt-template changes.
Define monitoring, data quality checks, and operational recovery procedures.

Constraints

Use AWS-managed services where possible; team size is 5 engineers.
Minimize duplicate LLM calls because each call has material cost.
PII may appear in prompts and outputs; encryption and access logging are required.
The design must tolerate regional worker restarts and at-least-once delivery semantics.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Idempotent Async Content Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Idempotent Async Content Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Idempotent Async Content Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer