Business Context
AppZen wants to adapt a pre-trained language model to a finance-specific workflow in AppZen Expense Audit: classifying expense report line items into policy outcomes such as approve, needs review, or reject based on receipt text, memo text, and merchant context. The goal is to reduce manual review load while preserving audit accuracy on high-risk spend.
Data
- Volume: 1.8M historical expense line items with reviewer outcomes
- Inputs: OCR receipt text, employee memo, merchant name, MCC code, currency, country, and policy snippets
- Text length: 20-1,200 tokens after OCR cleanup; median 180
- Language: 88% English, 12% multilingual receipts and memos
- Label distribution: Approve 72%, Needs Review 20%, Reject 8%
- Data quality issues: OCR noise, duplicated receipts, redacted fields, inconsistent merchant formatting
Success Criteria
A strong solution should achieve macro-F1 ≥ 0.84, reject-class recall ≥ 0.90, and inference latency under 250 ms per item in batch scoring. The system should also provide stable performance across major geographies and merchant categories.
Constraints
- Financial data must remain in AppZen-controlled infrastructure
- The solution must support weekly retraining with new reviewer feedback
- The model should fit on a single A10/T4-class GPU for fine-tuning and production inference
- Outputs must be auditable for compliance and reviewer trust
Requirements
- Design a fine-tuning approach for a finance-specific LLM or encoder model for this classification task.
- Define a realistic preprocessing pipeline for OCR-heavy financial text.
- Explain how you would handle class imbalance, multilingual inputs, and noisy labels.
- Provide Python code for preprocessing, fine-tuning, and evaluation using modern NLP tooling.
- Describe how you would validate the model, analyze errors, and decide whether it is ready for deployment in AppZen Expense Audit.