Business Context
Rang Technologies is building an NLP pipeline for classifying user queries submitted through Rang Search and internal support surfaces. Before training downstream models, the team needs a robust tokenization strategy that preserves meaning in short, noisy text while remaining compatible with modern transformer models.
Data Characteristics
- Volume: ~2.5M historical queries and support utterances
- Text length: 2-80 words, median 11 words
- Language: Primarily English, with ~7% mixed-language or code-switched text
- Input quality: Misspellings, emojis, URLs, product names, IDs, punctuation-heavy text
- Labels: Downstream intent classes are moderately imbalanced, with the top 5 intents covering ~68% of traffic
Success Criteria
A good solution should produce tokenized text that improves downstream intent classification quality, handles domain-specific vocabulary such as Rang product names and ticket IDs, and supports inference latency under 50 ms per query. The approach should also be easy to maintain as vocabulary evolves.
Constraints
- Must run in Rang Technologies' Python-based inference stack
- Should support transformer fine-tuning without custom C++ dependencies
- Must avoid destructive normalization that removes useful product or entity signals
Requirements
- Explain how you would tokenize short user text for an NLP task in Rang Search.
- Compare basic word tokenization with subword tokenization and state when each is appropriate.
- Build a preprocessing pipeline for noisy query text, including handling URLs, casing, emojis, and domain-specific terms.
- Implement tokenization in Python using a modern library compatible with transformer models.
- Show how tokenized outputs feed into a downstream intent classification model.
- Describe how you would validate that the tokenization strategy is helping rather than hurting model performance.