Business Context
LexiDesk, a SaaS customer support platform, is rebuilding its internal NLP pipeline for ticket routing and semantic search. Before training downstream models, the team wants to validate that the tokenization strategy preserves meaning, handles noisy text, and supports modern transformer models.
Data
- Volume: 850,000 historical support tickets and chat transcripts
- Text length: 5-300 words per message, median 42 words
- Language: English only for v1
- Text characteristics: typos, URLs, order IDs, emojis, contractions, product names, and mixed casing
- Labels available for downstream validation: 12 ticket categories with moderate class imbalance (largest class 24%, smallest 3%)
Success Criteria
A good solution should explain what tokenization is, why it matters for NLP, and implement a practical tokenizer pipeline that improves downstream classification quality without violating latency constraints. The final approach should support transformer fine-tuning and keep preprocessing consistent between training and inference.
Constraints
- Inference latency for preprocessing + model scoring must stay under 120ms per ticket
- Solution must run on a single CPU service for preprocessing and one T4 GPU for model inference
- Avoid brittle rule-based token splitting that breaks product SKUs or URLs
Requirements
- Define tokenization clearly and explain why it is important in NLP systems.
- Compare at least two tokenization approaches, such as whitespace/token regex and subword tokenization.
- Build a preprocessing pipeline for noisy support text using modern Python tooling.
- Fine-tune a lightweight transformer classifier using the selected tokenizer.
- Evaluate how tokenization choices affect downstream performance, vocabulary coverage, and failure cases.
- Describe trade-offs around OOV handling, sequence length, latency, and maintainability.