
You are working on document intelligence models that consume text extracted from receipts and contracts. The raw input often comes from OCR, so it may contain broken line breaks, inconsistent casing, currency formats, vendor names, dates, clause numbering, and scanning artifacts. Some downstream models need clean tokens for classification, while others need important entities and structure preserved for extraction tasks.
How would you preprocess text from receipts and contracts before modeling?
Task-aware tokenization for OCR textPreserving entities for receipt and contract extractionFeature engineering from noisy document textBalancing cleanup with downstream classification needs