You are building an information extraction pipeline for an enterprise support platform that ingests roughly 200,000 customer tickets per week. Operations teams want named entities such as product names, customer organizations, locations, dates, and person names extracted from noisy free-text messages so they can route issues, link accounts, and power downstream search. The data includes email-style formatting, signatures, ticket metadata pasted into the body, abbreviations, and occasional OCR artifacts from attachments, and you have about 40,000 manually annotated tickets plus a larger pool of unlabeled historical text. The system needs to generalize to new product names and organization aliases that appear over time.
How would you design and implement a named entity recognition system for this workflow, including your preprocessing choices, modeling approach, and how you would evaluate whether the extracted entities are reliable enough for downstream use?