Business Context
HelpDeskPro wants to reduce manual labeling time for incoming support tickets before training downstream classifiers for routing and prioritization. Instead of annotating tickets one by one, the team wants to group semantically similar tickets so annotators can label batches with the same issue type.
Data
- Volume: 180,000 historical support tickets and ~8,000 new tickets per day
- Text length: 8-250 words per ticket, median 42 words
- Language: English only
- Label distribution: 12 issue categories, highly imbalanced; the top 3 categories account for ~65% of tickets
- Text quality: noisy user-generated text with typos, URLs, order IDs, device names, and repeated boilerplate signatures
Success Criteria
A good solution should reduce annotation effort by creating coherent groups that annotators can label in batches, while preserving enough separation between categories to avoid large mixed clusters. Assume the target is a 30% reduction in annotation time and cluster purity high enough that annotators accept most suggested batch labels with minimal correction.
Constraints
- Daily clustering pipeline must finish in under 20 minutes
- The approach must run on a single GPU or CPU-only fallback
- Clusters should be explainable enough for annotation tooling
- New tickets should be assignable to existing clusters without full retraining
Requirements
- Build a semantic grouping pipeline using vector embeddings for ticket text.
- Describe preprocessing for noisy support text, including normalization of IDs, URLs, and signatures.
- Generate embeddings, cluster similar tickets, and surface representative examples per cluster.
- Explain how you would choose the number of clusters or use a density-based method.
- Propose how annotators would review, relabel, split, or merge clusters in the workflow.
- Evaluate cluster quality and estimate annotation speedup versus random sampling.