Business Context
ShopSphere, an online marketplace, receives a large volume of customer reviews and wants to automatically classify review sentiment so product, support, and trust teams can monitor customer satisfaction and detect negative trends quickly.
Data Characteristics
- Volume: 850,000 historical product reviews collected over 18 months
- Text length: 5-300 words per review, median length 42 words
- Language: English only for the first version
- Labels: 3 sentiment classes — Positive (68%), Neutral (14%), Negative (18%)
- Text quality: Includes typos, emojis, repeated punctuation, HTML fragments, product codes, and occasional duplicated reviews
Success Criteria
A production-ready model should achieve macro-F1 ≥ 0.84 and negative-class recall ≥ 0.90, since missing negative reviews reduces the team’s ability to respond to product issues. Batch inference should process daily review loads within the existing Python pipeline, and online inference should stay under 120 ms per review.
Constraints
- Must run on a single GPU for training and CPU-compatible inference for deployment
- Model size should remain practical for a containerized service
- The solution should be explainable enough to support error review by non-ML stakeholders
Requirements
- Build an NLP pipeline for sentiment analysis as a text classification task.
- Define preprocessing for noisy user-generated review text.
- Implement a modern Python solution using a transformer baseline and a lightweight benchmark model.
- Handle class imbalance and justify the training setup.
- Evaluate the model with appropriate classification metrics and error analysis.
- Describe how you would deploy and monitor the model for drift, especially around new products and seasonal language changes.