Business Context
ShopEase wants to build a product-review classifier that routes customer feedback into categories such as shipping issues, product quality, returns, and general praise. Your task is to design and implement a robust text preprocessing pipeline that prepares noisy review text for downstream machine learning.
Data
- Volume: ~850,000 historical reviews collected over 18 months
- Text length: 5-300 words per review, median length 42 words
- Language: Primarily English (96%), with some mixed-language and emoji-heavy content
- Labels: 4 classes with moderate imbalance; "general praise" is ~45% of the dataset
- Noise: HTML fragments, URLs, repeated punctuation, misspellings, contractions, emojis, and duplicate reviews
Success Criteria
A good solution should produce a clean, reproducible preprocessing pipeline that improves downstream classification quality and is suitable for batch retraining. The processed features should support at least a strong TF-IDF baseline with macro-F1 0.78 on a held-out set.
Constraints
- Pipeline must run on a single CPU machine for daily batch jobs
- Preprocessing must be deterministic and easy to version
- Avoid aggressive cleaning that removes important sentiment or complaint signals
- Solution should be compatible with scikit-learn and transformer-based models
Requirements
- Implement a reusable Python preprocessing pipeline for raw review text.
- Handle normalization steps such as casing, HTML/URL removal, punctuation cleanup, and whitespace normalization.
- Apply tokenization and lemmatization, and explain when stopword removal should or should not be used.
- Generate machine-learning-ready features using TF-IDF as a baseline.
- Train and evaluate a simple classifier on the processed text.
- Briefly justify preprocessing choices and note trade-offs for transformer models versus sparse features.