Business Context
AMD Construction Group receives thousands of short free-text RFIs, field notes, and coordination comments from project teams using AMD Construction Group SiteFlow. The research team wants an NLP system that uses word embeddings or modern language models to understand intent and route each note to the right workflow.
Data
You are given historical SiteFlow text records labeled into 5 intent classes: design_clarification, material_request, schedule_change, safety_issue, and inspection_followup.
- Volume: 420,000 labeled records from 18 months of projects
- Text length: 8-220 words (median 34)
- Language: English only, but includes abbreviations, trade jargon, drawing references, and inconsistent casing
- Label distribution: design_clarification 31%, material_request 24%, schedule_change 18%, safety_issue 9%, inspection_followup 18%
- Noise: OCR artifacts from uploaded PDFs, duplicated boilerplate, and shorthand like
lvl 03, RCP, MEP, rev-2
Success Criteria
A production-ready solution should achieve macro-F1 >= 0.84, recall >= 0.92 for safety_issue, and support batch inference on 50,000 notes per hour. The system should also provide enough interpretability for operations teams to trust routing decisions.
Constraints
- Must run in AMD Construction Group's private cloud
- GPU budget is limited to a single A10 for training and CPU-first inference in production
- Retraining cadence is monthly
- Prediction latency target is under 150 ms per note for interactive use
Requirements
- Design an NLP pipeline that applies either static word embeddings or transformer language models to this intent classification task.
- Describe preprocessing for construction-domain text, including abbreviations, references, and noisy OCR fragments.
- Implement a modern Python solution and explain why your architecture fits the task.
- Show how you would train, validate, and evaluate the model under class imbalance.
- Explain trade-offs between embedding-based baselines and fine-tuned language models for AMD Construction Group SiteFlow.