You are building an information extraction pipeline for a construction operations platform that ingests daily site reports, safety observations, RFIs, and subcontractor notes. The text is noisy and semi-structured, with abbreviations, unit measurements, drawing references, equipment IDs, crew names, material codes, and location descriptions such as floor, gridline, or zone. You have about 35,000 annotated documents for entities like equipment, material, location, issue type, subcontractor, and date, plus a much larger pool of unlabeled reports collected over several years. The extracted entities will feed downstream search, analytics, and alerting workflows, so the system needs to handle domain-specific vocabulary and inconsistent formatting.
How would you design a named entity recognition pipeline for this setting, including your approach to preprocessing, model selection, training, and evaluation, and how would you make it robust to new terminology and annotation noise over time?