Business Context
ShopSphere wants to build a baseline NLP pipeline for classifying customer product reviews into sentiment categories before investing in larger transformer models. Your task is to design the text preprocessing workflow and show how it affects downstream model quality.
Data
You are given 180,000 English product reviews collected over 12 months from the ShopSphere marketplace.
- Task: classify each review as positive, neutral, or negative
- Text length: 5-300 words, median 42 words
- Language: English only, but reviews contain typos, emojis, HTML fragments, URLs, repeated punctuation, and inconsistent casing
- Label distribution: Positive 62%, Neutral 18%, Negative 20%
- Data quality issues: duplicate reviews, missing text in ~1.5% of rows, and noisy user-generated formatting
Success Criteria
A strong solution should produce a reproducible preprocessing pipeline, justify which cleaning steps are useful or harmful, and achieve macro-F1 >= 0.78 with a lightweight baseline model. The pipeline should be easy to retrain weekly and support inference on 50,000 reviews per hour.
Constraints
- Use Python NLP tooling suitable for production experimentation
- Keep preprocessing deterministic and versionable
- Avoid overly aggressive normalization that removes sentiment cues such as negation or emphasis
- The baseline should run on CPU-only infrastructure
Requirements
- Define the key preprocessing steps for noisy review text and explain why each step is needed.
- Implement a modern Python pipeline for preprocessing, feature extraction, model training, and evaluation.
- Compare at least one preprocessing choice (for example, lemmatization vs no lemmatization, stopword removal vs retaining negations).
- Train a baseline sentiment classifier using TF-IDF features.
- Evaluate the pipeline with appropriate classification metrics and summarize common failure cases.