Context
Airtable is building asynchronous AI-powered content generation for records in Interfaces and Automations, such as generating summaries, descriptions, or follow-up drafts from record fields. Today, requests are triggered from multiple surfaces and handled by stateless workers, but retries after worker crashes, upstream timeouts, or webhook replays can create duplicate generations, inconsistent record state, and wasted LLM spend.
Design a pipeline that safely processes content-generation jobs exactly once from the product perspective, while using at-least-once infrastructure underneath. Assume Airtable runs on AWS and wants a durable, observable pipeline that supports both user-triggered generation and bulk backfills.
Scale Requirements
- Peak ingest: 25K generation requests/minute during business hours
- Average payload: 8 KB request metadata + up to 64 KB prompt context
- LLM latency: P50 4s, P95 20s, occasional 60s timeouts
- Freshness target: 99% of successful jobs reflected in Airtable record state within 2 minutes
- Storage: 180 days of job history, 30 days of raw request/response payloads
- Retry behavior: tolerate duplicate delivery from queue, webhook replay, and worker restarts
Requirements
- Ingest requests from Airtable Automations, Airtable Interfaces, and internal APIs into a single durable pipeline.
- Guarantee idempotent job execution using a deterministic idempotency key derived from base_id, table_id, record_id, field_id, prompt_version, and input hash.
- Prevent duplicate work when the same request is submitted concurrently or retried after partial failure.
- Support safe retries for transient failures from the LLM provider, network timeouts, and downstream write conflicts.
- Persist job state transitions (
queued, leased, running, succeeded, failed, dead_lettered) with auditability.
- Write results back to Airtable records only if the result is still current for that record version.
- Provide a backfill path for reprocessing millions of historical records after prompt-template changes.
- Define monitoring, data quality checks, and operational recovery procedures.
Constraints
- Use AWS-managed services where possible; team size is 5 engineers.
- Minimize duplicate LLM calls because each call has material cost.
- PII may appear in prompts and outputs; encryption and access logging are required.
- The design must tolerate regional worker restarts and at-least-once delivery semantics.