Dataford
Interview Guides
Upgrade
All questions/NLP/Preprocess Product Reviews for Classification

Preprocess Product Reviews for Classification

Easy
NLP
Asked at 3 companies3TokenizationStemmingLemmatization
Also asked at
J.D. PowerVictoria's SecretThrive Market

Problem

Business Context

ShopSphere wants to build a baseline NLP pipeline for classifying customer product reviews into sentiment categories before investing in larger transformer models. Your task is to design the text preprocessing workflow and show how it affects downstream model quality.

Data

You are given 180,000 English product reviews collected over 12 months from the ShopSphere marketplace.

  • Task: classify each review as positive, neutral, or negative
  • Text length: 5-300 words, median 42 words
  • Language: English only, but reviews contain typos, emojis, HTML fragments, URLs, repeated punctuation, and inconsistent casing
  • Label distribution: Positive 62%, Neutral 18%, Negative 20%
  • Data quality issues: duplicate reviews, missing text in ~1.5% of rows, and noisy user-generated formatting

Success Criteria

A strong solution should produce a reproducible preprocessing pipeline, justify which cleaning steps are useful or harmful, and achieve macro-F1 >= 0.78 with a lightweight baseline model. The pipeline should be easy to retrain weekly and support inference on 50,000 reviews per hour.

Constraints

  • Use Python NLP tooling suitable for production experimentation
  • Keep preprocessing deterministic and versionable
  • Avoid overly aggressive normalization that removes sentiment cues such as negation or emphasis
  • The baseline should run on CPU-only infrastructure

Requirements

  1. Define the key preprocessing steps for noisy review text and explain why each step is needed.
  2. Implement a modern Python pipeline for preprocessing, feature extraction, model training, and evaluation.
  3. Compare at least one preprocessing choice (for example, lemmatization vs no lemmatization, stopword removal vs retaining negations).
  4. Train a baseline sentiment classifier using TF-IDF features.
  5. Evaluate the pipeline with appropriate classification metrics and summarize common failure cases.

Problem

Business Context

ShopSphere wants to build a baseline NLP pipeline for classifying customer product reviews into sentiment categories before investing in larger transformer models. Your task is to design the text preprocessing workflow and show how it affects downstream model quality.

Data

You are given 180,000 English product reviews collected over 12 months from the ShopSphere marketplace.

  • Task: classify each review as positive, neutral, or negative
  • Text length: 5-300 words, median 42 words
  • Language: English only, but reviews contain typos, emojis, HTML fragments, URLs, repeated punctuation, and inconsistent casing
  • Label distribution: Positive 62%, Neutral 18%, Negative 20%
  • Data quality issues: duplicate reviews, missing text in ~1.5% of rows, and noisy user-generated formatting

Success Criteria

A strong solution should produce a reproducible preprocessing pipeline, justify which cleaning steps are useful or harmful, and achieve macro-F1 >= 0.78 with a lightweight baseline model. The pipeline should be easy to retrain weekly and support inference on 50,000 reviews per hour.

Constraints

  • Use Python NLP tooling suitable for production experimentation
  • Keep preprocessing deterministic and versionable
  • Avoid overly aggressive normalization that removes sentiment cues such as negation or emphasis
  • The baseline should run on CPU-only infrastructure

Requirements

  1. Define the key preprocessing steps for noisy review text and explain why each step is needed.
  2. Implement a modern Python pipeline for preprocessing, feature extraction, model training, and evaluation.
  3. Compare at least one preprocessing choice (for example, lemmatization vs no lemmatization, stopword removal vs retaining negations).
  4. Train a baseline sentiment classifier using TF-IDF features.
  5. Evaluate the pipeline with appropriate classification metrics and summarize common failure cases.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
IdexcelPreprocess E-commerce Reviews for ClassificationEasyPreprocess E-commerce Reviews for ClassificationMediumEricssonClassify E-commerce Reviews by SentimentMedium
Next question