Context
Airtable wants to expand Airtable AI so users can generate and update record fields from natural-language prompts inside a Base. Example use cases include summarizing sales call notes into structured CRM fields, drafting campaign briefs from linked records, and answering questions over a table plus attached docs.
Today, one team proposes calling a generic model API directly from an Airtable automation step. Another proposes a custom workflow with prompt routing, retrieval over Base records and attachments, structured output validation, and safety checks. You need to recommend when Airtable should build the custom path versus using the generic API directly.
Constraints
- p95 latency: < 2.5s for synchronous Airtable AI actions
- Cost ceiling: < $0.03 per successful request at 8M requests/month
- Hallucination ceiling: < 2% on a labeled eval set for factual field generation
- Prompt injection success rate from attachments or long-text fields: < 0.5%
- Structured output validity for target schema: > 99%
- Must respect Airtable permissioning: retrieved records/attachments must be user-authorized
Available Resources
- Airtable Bases with tables, linked records, comments, long-text fields, and attachments
- Airtable Automations and Airtable AI surfaces for invocation
- A generic frontier model API plus a cheaper small-model API
- Internal metadata: field schemas, record history, user permissions, and usage logs
- 2,000 historical prompts with human-edited outputs; 500 candidate eval examples can be labeled this quarter
Deliverables
- Define an evaluation-first framework to compare a direct generic API call against a custom Airtable AI workflow.
- Propose an architecture for the custom path, including when to use retrieval, structured output, routing, or fallback to a generic model call.
- Specify the decision criteria for build vs buy by use case (e.g., freeform drafting vs grounded field updates).
- Estimate cost and latency for both approaches and explain how you would stay within budget at target volume.
- Identify major failure modes, especially hallucination, prompt injection from attachments, and schema-invalid outputs, and describe mitigations.