Business Context
NewsPulse, a media monitoring platform, needs a text representation strategy for classifying incoming news headlines and short article snippets into editorial topics. The team wants to compare a sparse lexical baseline (TF-IDF) against dense semantic representations (word embeddings) before committing to a production pipeline.
Data
- Volume: 180,000 labeled documents collected over 12 months
- Text length: 8-220 words (median: 34 words)
- Language: English only
- Labels: 6 topic classes — Politics (22%), Business (19%), Sports (18%), Technology (15%), Entertainment (14%), Health (12%)
- Noise: HTML fragments, duplicated wire headlines, inconsistent casing, punctuation-heavy social reposts
Success Criteria
A good solution should clearly explain the practical difference between TF-IDF and word embeddings, implement both approaches in Python, and demonstrate which representation performs better for this short-text classification task. Target macro-F1 ≥ 0.82 on a held-out test set, with a clear discussion of trade-offs in interpretability, latency, and semantic generalization.
Constraints
- Training must run on a single CPU or one small GPU
- Inference latency should stay under 50ms per document for batch scoring
- The solution should be easy for an editorial analytics team to maintain
Requirements
- Build a baseline classifier using TF-IDF features and a linear model.
- Build a second classifier using word embeddings aggregated at document level or a lightweight transformer embedding model.
- Define a realistic preprocessing pipeline for noisy news text.
- Compare both methods on performance, memory usage, and interpretability.
- Explain when TF-IDF is preferable and when embeddings are preferable in production.