Parse Poorly Scanned Receipt Fields

Business Context

LedgerLens, an expense management platform, ingests employee receipt photos from mobile devices and email uploads. Many receipts are poorly scanned, skewed, faded, or partially blurred, and the finance team needs an NLP pipeline that can still extract key fields reliably for downstream reimbursement.

Data

Volume: 850,000 historical receipts with OCR text and 120,000 manually verified field annotations
Text length: 10-250 OCR lines per receipt; 30-1,500 characters total
Language: Primarily English (78%), plus Spanish and French merchant text
Field distribution: merchant_name, transaction_date, total_amount, tax_amount, currency, payment_method; ~18% of receipts are missing one or more optional fields
Noise profile: OCR character substitutions (0/O, 1/I, 8/B), line breaks in the middle of entities, duplicated lines, and low-confidence OCR spans

Success Criteria

A production-ready solution should achieve at least 92% field-level F1 on merchant_name, transaction_date, and total_amount, with 95% recall on total amount. End-to-end inference should stay under 300 ms per receipt on a single T4 GPU.

Constraints

Must run in a secure environment; receipt images and text cannot leave company infrastructure
The system must degrade gracefully when OCR quality is poor and return confidence scores for human review
Model size should support batch inference for 20,000 receipts per hour

Requirements

Build an NLP pipeline that converts noisy OCR output into structured receipt fields.
Describe how you would preprocess OCR text from low-quality scans.
Implement a token-level extraction model and any post-processing rules needed for normalization.
Show how you would handle OCR uncertainty, multilingual merchant text, and missing fields.
Define an evaluation plan, including field-level metrics, validation strategy, and error analysis for poor scans.

Business Context

Data

Volume: 850,000 historical receipts with OCR text and 120,000 manually verified field annotations

Text length: 10-250 OCR lines per receipt; 30-1,500 characters total

Language: Primarily English (78%), plus Spanish and French merchant text

Field distribution: merchant_name, transaction_date, total_amount, tax_amount, currency, payment_method; ~18% of receipts are missing one or more optional fields

Noise profile: OCR character substitutions (0/O, 1/I, 8/B), line breaks in the middle of entities, duplicated lines, and low-confidence OCR spans

Requirements

Build an NLP pipeline that converts noisy OCR output into structured receipt fields.

Describe how you would preprocess OCR text from low-quality scans.

Implement a token-level extraction model and any post-processing rules needed for normalization.

Show how you would handle OCR uncertainty, multilingual merchant text, and missing fields.

Define an evaluation plan, including field-level metrics, validation strategy, and error analysis for poor scans.

Business Context

Data

Volume: 850,000 historical receipts with OCR text and 120,000 manually verified field annotations

Text length: 10-250 OCR lines per receipt; 30-1,500 characters total

Language: Primarily English (78%), plus Spanish and French merchant text

Field distribution: merchant_name, transaction_date, total_amount, tax_amount, currency, payment_method; ~18% of receipts are missing one or more optional fields

Noise profile: OCR character substitutions (0/O, 1/I, 8/B), line breaks in the middle of entities, duplicated lines, and low-confidence OCR spans

Requirements

Build an NLP pipeline that converts noisy OCR output into structured receipt fields.

Describe how you would preprocess OCR text from low-quality scans.

Implement a token-level extraction model and any post-processing rules needed for normalization.

Show how you would handle OCR uncertainty, multilingual merchant text, and missing fields.

Define an evaluation plan, including field-level metrics, validation strategy, and error analysis for poor scans.

Business Context

Data

Volume: 850,000 historical receipts with OCR text and 120,000 manually verified field annotations

Text length: 10-250 OCR lines per receipt; 30-1,500 characters total

Language: Primarily English (78%), plus Spanish and French merchant text

Field distribution: merchant_name, transaction_date, total_amount, tax_amount, currency, payment_method; ~18% of receipts are missing one or more optional fields

Noise profile: OCR character substitutions (0/O, 1/I, 8/B), line breaks in the middle of entities, duplicated lines, and low-confidence OCR spans

Requirements

Build an NLP pipeline that converts noisy OCR output into structured receipt fields.

Describe how you would preprocess OCR text from low-quality scans.

Implement a token-level extraction model and any post-processing rules needed for normalization.

Show how you would handle OCR uncertainty, multilingual merchant text, and missing fields.

Define an evaluation plan, including field-level metrics, validation strategy, and error analysis for poor scans.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Parse Poorly Scanned Receipt Fields

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Parse Poorly Scanned Receipt Fields

Business Context

Data

Success Criteria

Constraints

Requirements

Parse Poorly Scanned Receipt Fields

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer