Business Context
LexiSearch, an e-commerce search platform, is rebuilding its English-language text preprocessing pipeline for product search and review classification. The search relevance team wants to decide when stemming is sufficient and when lemmatization is worth the added complexity and latency.
Data
You are given a corpus of 850,000 English product reviews and search queries collected over 12 months.
- Text length: 2-120 tokens per record (median: 14)
- Language: English only
- Domains: apparel, electronics, home goods, beauty
- Labels for downstream task: 3-class sentiment (
negative, neutral, positive) with distribution 18% / 22% / 60%
- Noise: misspellings, repeated characters, punctuation-heavy queries, SKU-like tokens
The goal is to compare stemming and lemmatization as preprocessing choices in a realistic NLP pipeline, then measure their impact on a downstream text classification task.
Success Criteria
A good solution should clearly explain the linguistic difference between stemming and lemmatization, implement both approaches in Python, and show which preprocessing choice performs better for sentiment classification without exceeding batch preprocessing SLAs.
Constraints
- Preprocessing must run on a single CPU worker for offline batch jobs
- Average preprocessing time should stay under 20 ms per document
- Solution must preserve enough lexical meaning for model interpretability
Requirements
- Explain the difference between stemming and lemmatization with concrete examples.
- Build two preprocessing pipelines: one using NLTK PorterStemmer, one using spaCy lemmatization.
- Train and compare a TF-IDF + Logistic Regression sentiment classifier for each pipeline.
- Report trade-offs in vocabulary size, speed, and classification quality.
- Recommend which approach to use for search/query normalization vs sentiment modeling.