Compare Tokenization for Ticket Routing

Business Context

ShopFlow, an e-commerce operations platform, wants to improve automatic routing of customer support tickets to the correct team. The NLP team needs a practical comparison of classical tokenization + Word2Vec embeddings versus modern subword tokenization + BERT embeddings for downstream classification.

Data

Volume: 420,000 historical support tickets collected over 18 months
Text length: 8-220 words per ticket, median 46 words
Language: English only
Labels: 5 routing classes — Billing (22%), Shipping (28%), Returns (18%), Account Access (14%), Product Issue (18%)
Text characteristics: informal grammar, typos, order IDs, URLs, SKU codes, and repeated template phrases

Success Criteria

A good solution should clearly explain how tokenization affects vocabulary coverage, out-of-vocabulary handling, and downstream embedding quality. The final classifier should achieve macro-F1 >= 0.84, with per-class recall >= 0.80 for Billing and Account Access because misrouting these tickets creates SLA risk.

Constraints

Inference latency must stay below 120 ms per ticket in production
Training must run on a single GPU with 16 GB VRAM
The solution must support weekly retraining and easy monitoring of vocabulary drift

Requirements

Implement a preprocessing pipeline for noisy support-ticket text.
Compare whitespace/word-level tokenization with BERT subword tokenization.
Train one baseline model using Word2Vec-based document embeddings.
Fine-tune one transformer classifier using BERT embeddings.
Evaluate both approaches and explain trade-offs in accuracy, robustness to typos, and deployment cost.
Show how tokenization decisions affect unknown terms such as new product names and order codes.

Business Context

Data

Volume: 420,000 historical support tickets collected over 18 months
Text length: 8-220 words per ticket, median 46 words
Language: English only
Labels: 5 routing classes — Billing (22%), Shipping (28%), Returns (18%), Account Access (14%), Product Issue (18%)
Text characteristics: informal grammar, typos, order IDs, URLs, SKU codes, and repeated template phrases

Success Criteria

Constraints

Inference latency must stay below 120 ms per ticket in production
Training must run on a single GPU with 16 GB VRAM
The solution must support weekly retraining and easy monitoring of vocabulary drift

Requirements

Implement a preprocessing pipeline for noisy support-ticket text.
Compare whitespace/word-level tokenization with BERT subword tokenization.
Train one baseline model using Word2Vec-based document embeddings.
Fine-tune one transformer classifier using BERT embeddings.
Evaluate both approaches and explain trade-offs in accuracy, robustness to typos, and deployment cost.
Show how tokenization decisions affect unknown terms such as new product names and order codes.

Business Context

Data

Volume: 420,000 historical support tickets collected over 18 months
Text length: 8-220 words per ticket, median 46 words
Language: English only
Labels: 5 routing classes — Billing (22%), Shipping (28%), Returns (18%), Account Access (14%), Product Issue (18%)
Text characteristics: informal grammar, typos, order IDs, URLs, SKU codes, and repeated template phrases

Success Criteria

Constraints

Inference latency must stay below 120 ms per ticket in production
Training must run on a single GPU with 16 GB VRAM
The solution must support weekly retraining and easy monitoring of vocabulary drift

Requirements

Implement a preprocessing pipeline for noisy support-ticket text.
Compare whitespace/word-level tokenization with BERT subword tokenization.
Train one baseline model using Word2Vec-based document embeddings.
Fine-tune one transformer classifier using BERT embeddings.
Evaluate both approaches and explain trade-offs in accuracy, robustness to typos, and deployment cost.
Show how tokenization decisions affect unknown terms such as new product names and order codes.

Business Context

Data

Volume: 420,000 historical support tickets collected over 18 months
Text length: 8-220 words per ticket, median 46 words
Language: English only
Labels: 5 routing classes — Billing (22%), Shipping (28%), Returns (18%), Account Access (14%), Product Issue (18%)
Text characteristics: informal grammar, typos, order IDs, URLs, SKU codes, and repeated template phrases

Success Criteria

Constraints

Inference latency must stay below 120 ms per ticket in production
Training must run on a single GPU with 16 GB VRAM
The solution must support weekly retraining and easy monitoring of vocabulary drift

Requirements

Implement a preprocessing pipeline for noisy support-ticket text.
Compare whitespace/word-level tokenization with BERT subword tokenization.
Train one baseline model using Word2Vec-based document embeddings.
Fine-tune one transformer classifier using BERT embeddings.
Evaluate both approaches and explain trade-offs in accuracy, robustness to typos, and deployment cost.
Show how tokenization decisions affect unknown terms such as new product names and order codes.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Compare Tokenization for Ticket Routing

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Compare Tokenization for Ticket Routing

Business Context

Data

Success Criteria

Constraints

Requirements

Compare Tokenization for Ticket Routing

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer