Business Context
AMD Construction Group wants to automatically route incoming BuildOps service tickets to the correct work category so dispatchers can prioritize field teams faster. You need to design a practical text classification pipeline using tokenization and TF-IDF features rather than transformer fine-tuning.
Data
The dataset contains approximately 180,000 historical AMD Construction Group BuildOps tickets collected over 24 months. Each record includes a short title and a free-text description written by coordinators, site supervisors, or customers. Text is primarily English, often noisy, and includes abbreviations such as "RTU", "AHU", "RFI", "punch list", and job-site shorthand.
- Classes: HVAC, Electrical, Plumbing, Concrete, Safety, General Inquiry
- Label distribution: moderately imbalanced; General Inquiry and HVAC are the largest classes, Safety is the smallest
- Text length: 5-180 words, median ~32 words
- Data quality issues: typos, duplicated tickets, all-caps text, equipment codes, phone numbers, and inconsistent punctuation
Success Criteria
A strong solution should achieve macro-F1 >= 0.82 and Safety recall >= 0.90 on a held-out test set, while keeping inference lightweight enough for near-real-time routing in AMD Construction Group BuildOps.
Constraints
- Must run on CPU-only infrastructure
- End-to-end prediction latency should be <50 ms per ticket in batch scoring
- Solution should be explainable to operations managers
- Prefer sparse, maintainable models over large neural architectures
Requirements
- Build a multi-class text classifier using tokenization and TF-IDF.
- Define a realistic preprocessing pipeline for AMD Construction Group BuildOps ticket text.
- Train at least one linear baseline model and justify your choice.
- Evaluate class-level performance, especially for Safety tickets.
- Show how you would inspect important TF-IDF features and common failure modes.
- Provide production-ready Python code for preprocessing, training, and evaluation.