Business Context
LedgerLens, an expense management platform, ingests employee receipt photos from mobile devices and email uploads. Many receipts are poorly scanned, skewed, faded, or partially blurred, and the finance team needs an NLP pipeline that can still extract key fields reliably for downstream reimbursement.
Data
- Volume: 850,000 historical receipts with OCR text and 120,000 manually verified field annotations
- Text length: 10-250 OCR lines per receipt; 30-1,500 characters total
- Language: Primarily English (78%), plus Spanish and French merchant text
- Field distribution:
merchant_name, transaction_date, total_amount, tax_amount, currency, payment_method; ~18% of receipts are missing one or more optional fields
- Noise profile: OCR character substitutions (
0/O, 1/I, 8/B), line breaks in the middle of entities, duplicated lines, and low-confidence OCR spans
Success Criteria
A production-ready solution should achieve at least 92% field-level F1 on merchant_name, transaction_date, and total_amount, with 95% recall on total amount. End-to-end inference should stay under 300 ms per receipt on a single T4 GPU.
Constraints
- Must run in a secure environment; receipt images and text cannot leave company infrastructure
- The system must degrade gracefully when OCR quality is poor and return confidence scores for human review
- Model size should support batch inference for 20,000 receipts per hour
Requirements
- Build an NLP pipeline that converts noisy OCR output into structured receipt fields.
- Describe how you would preprocess OCR text from low-quality scans.
- Implement a token-level extraction model and any post-processing rules needed for normalization.
- Show how you would handle OCR uncertainty, multilingual merchant text, and missing fields.
- Define an evaluation plan, including field-level metrics, validation strategy, and error analysis for poor scans.