Business Context
ShopFlow, an e-commerce operations platform, wants to improve automatic routing of customer support tickets to the correct team. The NLP team needs a practical comparison of classical tokenization + Word2Vec embeddings versus modern subword tokenization + BERT embeddings for downstream classification.
Data
- Volume: 420,000 historical support tickets collected over 18 months
- Text length: 8-220 words per ticket, median 46 words
- Language: English only
- Labels: 5 routing classes — Billing (22%), Shipping (28%), Returns (18%), Account Access (14%), Product Issue (18%)
- Text characteristics: informal grammar, typos, order IDs, URLs, SKU codes, and repeated template phrases
Success Criteria
A good solution should clearly explain how tokenization affects vocabulary coverage, out-of-vocabulary handling, and downstream embedding quality. The final classifier should achieve macro-F1 >= 0.84, with per-class recall >= 0.80 for Billing and Account Access because misrouting these tickets creates SLA risk.
Constraints
- Inference latency must stay below 120 ms per ticket in production
- Training must run on a single GPU with 16 GB VRAM
- The solution must support weekly retraining and easy monitoring of vocabulary drift
Requirements
- Implement a preprocessing pipeline for noisy support-ticket text.
- Compare whitespace/word-level tokenization with BERT subword tokenization.
- Train one baseline model using Word2Vec-based document embeddings.
- Fine-tune one transformer classifier using BERT embeddings.
- Evaluate both approaches and explain trade-offs in accuracy, robustness to typos, and deployment cost.
- Show how tokenization decisions affect unknown terms such as new product names and order codes.