Business Context
FinFlow, a document automation platform, processes invoices, receipts, and tax forms submitted by enterprise customers. Standard OCR works on clean scans, but accuracy drops on multi-column layouts, tables, stamps, handwritten notes, and low-quality mobile photos, so the team wants a multimodal model that uses text, layout, and image signals together.
Data
- Volume: 2.4M document pages with OCR tokens and bounding boxes; 180K pages have field-level ground truth
- Document types: invoices (52%), receipts (21%), forms (17%), statements (10%)
- Text length: 20-2,500 OCR tokens per page; median 310
- Language: English (78%), Spanish (14%), French (8%)
- Labels: key-value extraction targets such as
invoice_number, vendor_name, invoice_date, total_amount, tax_amount, and line_item_total
- Class skew: some fields appear on >95% of pages, others on <40%
Success Criteria
A production-ready system should achieve entity-level F1 >= 0.90 on core fields, page inference latency < 400 ms on an A10 GPU, and remain robust to rotated scans, noisy OCR, and variable templates.
Constraints
- PII must remain in a private VPC
- The model must support weekly retraining with newly labeled documents
- The solution should degrade gracefully when OCR quality is poor
Requirements
- Build a multimodal OCR post-processing pipeline that combines OCR text, token coordinates, and page image features.
- Fine-tune a layout-aware transformer for token-level field extraction.
- Define preprocessing for OCR normalization, bounding-box scaling, and long-page truncation/chunking.
- Show a modern Python implementation using Hugging Face Transformers and a realistic training loop.
- Explain how you would evaluate extraction quality across document types, languages, and OCR noise levels.
- Describe fallback behavior for pages with missing tokens, bad scans, or unseen templates.