Preprocess Product Reviews for Classification

Business Context

ShopSphere wants to build a baseline NLP pipeline for classifying customer product reviews into sentiment categories before investing in larger transformer models. Your task is to design the text preprocessing workflow and show how it affects downstream model quality.

Data

You are given 180,000 English product reviews collected over 12 months from the ShopSphere marketplace.

Task: classify each review as positive, neutral, or negative
Text length: 5-300 words, median 42 words
Language: English only, but reviews contain typos, emojis, HTML fragments, URLs, repeated punctuation, and inconsistent casing
Label distribution: Positive 62%, Neutral 18%, Negative 20%
Data quality issues: duplicate reviews, missing text in ~1.5% of rows, and noisy user-generated formatting

Success Criteria

A strong solution should produce a reproducible preprocessing pipeline, justify which cleaning steps are useful or harmful, and achieve macro-F1 >= 0.78 with a lightweight baseline model. The pipeline should be easy to retrain weekly and support inference on 50,000 reviews per hour.

Constraints

Use Python NLP tooling suitable for production experimentation
Keep preprocessing deterministic and versionable
Avoid overly aggressive normalization that removes sentiment cues such as negation or emphasis
The baseline should run on CPU-only infrastructure

Requirements

Define the key preprocessing steps for noisy review text and explain why each step is needed.
Implement a modern Python pipeline for preprocessing, feature extraction, model training, and evaluation.
Compare at least one preprocessing choice (for example, lemmatization vs no lemmatization, stopword removal vs retaining negations).
Train a baseline sentiment classifier using TF-IDF features.
Evaluate the pipeline with appropriate classification metrics and summarize common failure cases.

Data

You are given 180,000 English product reviews collected over 12 months from the ShopSphere marketplace.

Task: classify each review as positive, neutral, or negative

Text length: 5-300 words, median 42 words

Language: English only, but reviews contain typos, emojis, HTML fragments, URLs, repeated punctuation, and inconsistent casing

Label distribution: Positive 62%, Neutral 18%, Negative 20%

Data quality issues: duplicate reviews, missing text in ~1.5% of rows, and noisy user-generated formatting

Requirements

Define the key preprocessing steps for noisy review text and explain why each step is needed.

Implement a modern Python pipeline for preprocessing, feature extraction, model training, and evaluation.

Compare at least one preprocessing choice (for example, lemmatization vs no lemmatization, stopword removal vs retaining negations).

Train a baseline sentiment classifier using TF-IDF features.

Evaluate the pipeline with appropriate classification metrics and summarize common failure cases.

Problem

Business Context

Data

Success Criteria

Constraints

Requirements

Preprocess Product Reviews for Classification

Problem

Business Context

Data

Success Criteria

Constraints

Requirements