Business Context
LexiSearch is upgrading its query understanding stack for e-commerce search. The team wants a clear, technically grounded explanation of why Transformer-based models outperform older sequence models such as RNNs and LSTMs on intent classification and semantic relevance tasks.
Data
You are given a corpus of 2.4M search queries paired with product-category intent labels and click-derived relevance judgments.
- Volume: 2.4M labeled queries, 180K held-out examples
- Text length: 2-40 tokens per query; some training examples include query rewrites up to 120 tokens
- Language: English only
- Label distribution: 12 intent classes, moderately imbalanced; top 3 classes account for 61% of traffic
- Noise: Misspellings, abbreviations, SKU codes, brand names, and short telegraphic text
Success Criteria
A strong answer should explain the performance gap in terms of parallelization, long-range dependency modeling, self-attention, contextual token representations, and transfer learning from large-scale pretraining. It should also connect those ideas to measurable gains in downstream NLP tasks.
Constraints
- The explanation must be understandable to both ML engineers and product stakeholders
- Support the explanation with a modern Python implementation, not only theory
- Inference should remain practical for batch scoring on a single A10 GPU
Requirements
- Explain why Transformers outperform prior sequence models on short and medium-length text tasks.
- Build a baseline LSTM classifier and a Transformer classifier for the same dataset.
- Include a realistic preprocessing pipeline for noisy search queries.
- Compare architectures, training behavior, and expected error patterns.
- Define how you would evaluate whether the Transformer advantage is real and operationally useful.