Business Context
ShopFlow, an e-commerce platform, receives thousands of customer support tickets each day. The operations team wants a lightweight text classification system that routes tickets to the correct queue without using large transformer models.
Data
You are given 420,000 historical support tickets labeled into 5 categories: Refund, Shipping Issue, Account Access, Product Defect, and Other.
- Text source: ticket subject + message body
- Text length: 5-350 words, median 42 words
- Language: English only
- Label distribution: Refund 28%, Shipping Issue 24%, Account Access 18%, Product Defect 15%, Other 15%
- Noise: HTML fragments, order IDs, URLs, repeated punctuation, misspellings, and copied email signatures
Success Criteria
A production-ready baseline should achieve:
- Macro F1 >= 0.82 on a held-out test set
- Recall >= 0.90 for Account Access tickets
- Inference latency < 20ms per ticket in batch scoring on CPU
- Clear feature interpretability for support operations review
Constraints
- Use a TF-IDF-based approach, not embeddings or transformers
- Solution must run in a standard Python service on CPU
- The pipeline should be reproducible and easy to retrain weekly
Requirements
- Build an end-to-end multi-class text classification pipeline using TF-IDF features.
- Define a realistic preprocessing strategy for noisy support text.
- Choose and justify an appropriate classifier (for example, Logistic Regression or Linear SVM).
- Show how you would tune TF-IDF parameters such as n-gram range, min_df, max_df, and sublinear_tf.
- Evaluate the model with class-level metrics and confusion analysis.
- Explain how you would inspect top weighted terms to validate routing behavior and detect spurious correlations.