FinFlow automates accounts payable for mid-market retailers and receives scanned invoices from thousands of vendors. OCR alone is producing unreliable outputs because key fields such as invoice number, total amount, tax, and line items appear in different positions and table structures across templates.
You are given 180,000 invoice pages collected over 18 months from 12,000 vendors. Documents are primarily English, with about 15% containing mixed English/French headers. Inputs are PDFs or images converted to OCR tokens with bounding boxes, page size metadata, and reading order. Page length ranges from 1 to 4 pages, with 300-2,500 OCR tokens per page. Labels are available for document regions (header, vendor_block, billing_block, line_items_table, totals_block, footer) and for downstream key fields (invoice_id, invoice_date, due_date, subtotal, tax, total). Roughly 20% of pages contain noisy scans, skew, stamps, or handwritten notes.
A strong solution should improve structured field extraction by first performing document layout analysis, achieving >=92% macro-F1 on region labeling and >=95% recall on line_items_table and totals_block, since missing those regions causes downstream extraction failures.