Normalize E-commerce Search Queries

Business Context

ShopLane, an online marketplace, wants to improve search-query routing for downstream intent classification. The search team needs a preprocessing pipeline that applies tokenization, stemming, or lemmatization appropriately before model training.

Data

You are given 180,000 English search queries collected over 6 months. Queries range from 2 to 25 tokens (median: 5) and are noisy: mixed casing, punctuation, misspellings, SKU-like strings, and abbreviations such as "w/", "xl", and "refurb". Each query is labeled with one of 4 intents: product_search (62%), support (14%), returns (9%), and store_policy (15%). Roughly 6% of records are near-duplicates, and 3% contain only alphanumeric product codes.

Success Criteria

A good solution should improve normalization consistency without harming intent signal. Target macro-F1 >= 0.82 on a held-out test set, and clearly justify when stemming or lemmatization is preferable. The pipeline should be reproducible and easy to maintain.

Constraints

Must run in Python using common NLP libraries such as nltk, spaCy, and scikit-learn
End-to-end preprocessing plus inference should support batch scoring of 50K queries/hour on a single CPU instance
Avoid overly aggressive normalization that merges distinct product meanings

Requirements

Build an NLP preprocessing pipeline that includes tokenization and compares stemming vs lemmatization.
Explain which normalization strategy you would use for short e-commerce queries and why.
Train a baseline intent classifier using the processed text.
Show how preprocessing choices affect vocabulary size and model performance.
Describe edge cases such as product codes, plural nouns, and abbreviated brand or size terms.

Business Context

Data

Success Criteria

Constraints

Must run in Python using common NLP libraries such as nltk, spaCy, and scikit-learn
End-to-end preprocessing plus inference should support batch scoring of 50K queries/hour on a single CPU instance
Avoid overly aggressive normalization that merges distinct product meanings

Requirements

Build an NLP preprocessing pipeline that includes tokenization and compares stemming vs lemmatization.
Explain which normalization strategy you would use for short e-commerce queries and why.
Train a baseline intent classifier using the processed text.
Show how preprocessing choices affect vocabulary size and model performance.
Describe edge cases such as product codes, plural nouns, and abbreviated brand or size terms.

Business Context

Data

Success Criteria

Constraints

Must run in Python using common NLP libraries such as nltk, spaCy, and scikit-learn
End-to-end preprocessing plus inference should support batch scoring of 50K queries/hour on a single CPU instance
Avoid overly aggressive normalization that merges distinct product meanings

Requirements

Build an NLP preprocessing pipeline that includes tokenization and compares stemming vs lemmatization.
Explain which normalization strategy you would use for short e-commerce queries and why.
Train a baseline intent classifier using the processed text.
Show how preprocessing choices affect vocabulary size and model performance.
Describe edge cases such as product codes, plural nouns, and abbreviated brand or size terms.

Business Context

Data

Success Criteria

Constraints

Must run in Python using common NLP libraries such as nltk, spaCy, and scikit-learn
End-to-end preprocessing plus inference should support batch scoring of 50K queries/hour on a single CPU instance
Avoid overly aggressive normalization that merges distinct product meanings

Requirements

Build an NLP preprocessing pipeline that includes tokenization and compares stemming vs lemmatization.
Explain which normalization strategy you would use for short e-commerce queries and why.
Train a baseline intent classifier using the processed text.
Show how preprocessing choices affect vocabulary size and model performance.
Describe edge cases such as product codes, plural nouns, and abbreviated brand or size terms.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Normalize E-commerce Search Queries

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Normalize E-commerce Search Queries

Business Context

Data

Success Criteria

Constraints

Requirements

Normalize E-commerce Search Queries

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer