You are building an extraction system for a finance automation platform that ingests millions of unstructured documents each month, including invoices, receipts, purchase orders, and vendor emails. The documents arrive as PDFs, scans, images, and raw text, with inconsistent layouts, OCR noise, tables, handwritten annotations, and multiple languages. Downstream workflows depend on structured fields such as vendor name, invoice number, dates, currency, line items, tax amounts, payment terms, and policy-related signals, but labeled data is incomplete and schema coverage varies by document type. You need an NLP-first approach that can generalize across templates while remaining robust to long documents and low-quality text.
How would you design and implement an unstructured data extraction system for this setting, including the modeling approach, preprocessing pipeline, and evaluation strategy needed to make it reliable at scale?