ShopLane, an online marketplace, wants to improve search-query routing for downstream intent classification. The search team needs a preprocessing pipeline that applies tokenization, stemming, or lemmatization appropriately before model training.
You are given 180,000 English search queries collected over 6 months. Queries range from 2 to 25 tokens (median: 5) and are noisy: mixed casing, punctuation, misspellings, SKU-like strings, and abbreviations such as "w/", "xl", and "refurb". Each query is labeled with one of 4 intents: product_search (62%), support (14%), returns (9%), and store_policy (15%). Roughly 6% of records are near-duplicates, and 3% contain only alphanumeric product codes.
A good solution should improve normalization consistency without harming intent signal. Target macro-F1 >= 0.82 on a held-out test set, and clearly justify when stemming or lemmatization is preferable. The pipeline should be reproducible and easy to maintain.