You are building a document-ingestion service that converts unstructured business documents into structured records for downstream underwriting and operations workflows. Inputs include emails, invoices, bank statements, contracts, and broker-submitted notes, all with inconsistent formatting and occasional OCR noise. The system must extract fields such as legal entity name, invoice amount, due date, remittance address, counterparty, and confidence or evidence for each field. Volume is about 200K documents per month, and operations teams will review only a small fraction of outputs.
How would you design the extraction system so it produces reliable structured output from noisy text while meeting the latency and cost targets? Explain the approach you would take to prompting, validation, evaluation, and handling failure modes such as hallucinated fields, conflicting evidence, and prompt injection embedded in the source text.