Business Context
ShopSphere wants to standardize text preprocessing for customer reviews before training downstream sentiment and topic classifiers. Your task is to design and implement a practical preprocessing pipeline using tokenization, stemming, and lemmatization, then justify when each technique should be used.
Data
- Volume: 250,000 English product reviews collected over 18 months
- Text length: 5-300 words per review, median length 42 words
- Language: Primarily English, with minor noise from emojis, URLs, HTML fragments, and misspellings
- Labels: 3 sentiment classes — positive (62%), neutral (18%), negative (20%)
- Text characteristics: Informal language, repeated punctuation, contractions, product codes, and domain terms such as SKU names
Success Criteria
A good solution should produce a reusable preprocessing pipeline that improves consistency of model inputs, preserves sentiment-bearing terms, and supports a baseline classifier with macro-F1 >= 0.80 on a held-out validation set.
Constraints
- Pipeline must run on CPU for batch preprocessing of daily review uploads
- Preprocessing should be reproducible and easy to swap between stemming and lemmatization modes
- Avoid aggressive normalization that removes important sentiment cues such as negation
Requirements
- Build a preprocessing pipeline that includes tokenization and compares stemming vs lemmatization.
- Explain which cleaning steps should happen before and after tokenization.
- Train a simple baseline sentiment classifier using the processed text.
- Show how preprocessing choices affect vocabulary size and classification quality.
- Provide Python code using modern NLP libraries such as
nltk, spaCy, and scikit-learn.
- Describe trade-offs between stemming, lemmatization, and minimal normalization for production use.