Preprocess E-commerce Reviews for Classification

Business Context

ShopSphere wants to standardize text preprocessing for customer reviews before training downstream sentiment and topic classifiers. Your task is to design and implement a practical preprocessing pipeline using tokenization, stemming, and lemmatization, then justify when each technique should be used.

Data

Volume: 250,000 English product reviews collected over 18 months
Text length: 5-300 words per review, median length 42 words
Language: Primarily English, with minor noise from emojis, URLs, HTML fragments, and misspellings
Labels: 3 sentiment classes — positive (62%), neutral (18%), negative (20%)
Text characteristics: Informal language, repeated punctuation, contractions, product codes, and domain terms such as SKU names

Success Criteria

A good solution should produce a reusable preprocessing pipeline that improves consistency of model inputs, preserves sentiment-bearing terms, and supports a baseline classifier with macro-F1 >= 0.80 on a held-out validation set.

Constraints

Pipeline must run on CPU for batch preprocessing of daily review uploads
Preprocessing should be reproducible and easy to swap between stemming and lemmatization modes
Avoid aggressive normalization that removes important sentiment cues such as negation

Requirements

Build a preprocessing pipeline that includes tokenization and compares stemming vs lemmatization.
Explain which cleaning steps should happen before and after tokenization.
Train a simple baseline sentiment classifier using the processed text.
Show how preprocessing choices affect vocabulary size and classification quality.
Provide Python code using modern NLP libraries such as nltk, spaCy, and scikit-learn.
Describe trade-offs between stemming, lemmatization, and minimal normalization for production use.

Business Context

Data

Volume: 250,000 English product reviews collected over 18 months

Text length: 5-300 words per review, median length 42 words

Language: Primarily English, with minor noise from emojis, URLs, HTML fragments, and misspellings

Labels: 3 sentiment classes — positive (62%), neutral (18%), negative (20%)

Text characteristics: Informal language, repeated punctuation, contractions, product codes, and domain terms such as SKU names

Requirements

Build a preprocessing pipeline that includes tokenization and compares stemming vs lemmatization.

Explain which cleaning steps should happen before and after tokenization.

Train a simple baseline sentiment classifier using the processed text.

Show how preprocessing choices affect vocabulary size and classification quality.

Provide Python code using modern NLP libraries such as nltk, spaCy, and scikit-learn.

Describe trade-offs between stemming, lemmatization, and minimal normalization for production use.

Business Context

Data

Volume: 250,000 English product reviews collected over 18 months

Text length: 5-300 words per review, median length 42 words

Language: Primarily English, with minor noise from emojis, URLs, HTML fragments, and misspellings

Labels: 3 sentiment classes — positive (62%), neutral (18%), negative (20%)

Text characteristics: Informal language, repeated punctuation, contractions, product codes, and domain terms such as SKU names

Requirements

Build a preprocessing pipeline that includes tokenization and compares stemming vs lemmatization.

Explain which cleaning steps should happen before and after tokenization.

Train a simple baseline sentiment classifier using the processed text.

Show how preprocessing choices affect vocabulary size and classification quality.

Provide Python code using modern NLP libraries such as nltk, spaCy, and scikit-learn.

Describe trade-offs between stemming, lemmatization, and minimal normalization for production use.

Business Context

Data

Volume: 250,000 English product reviews collected over 18 months

Text length: 5-300 words per review, median length 42 words

Language: Primarily English, with minor noise from emojis, URLs, HTML fragments, and misspellings

Labels: 3 sentiment classes — positive (62%), neutral (18%), negative (20%)

Text characteristics: Informal language, repeated punctuation, contractions, product codes, and domain terms such as SKU names

Requirements

Build a preprocessing pipeline that includes tokenization and compares stemming vs lemmatization.

Explain which cleaning steps should happen before and after tokenization.

Train a simple baseline sentiment classifier using the processed text.

Show how preprocessing choices affect vocabulary size and classification quality.

Provide Python code using modern NLP libraries such as nltk, spaCy, and scikit-learn.

Describe trade-offs between stemming, lemmatization, and minimal normalization for production use.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Preprocess E-commerce Reviews for Classification

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Preprocess E-commerce Reviews for Classification

Business Context

Data

Success Criteria

Constraints

Requirements

Preprocess E-commerce Reviews for Classification

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer