Business Context
HelpHive, a SaaS customer support platform, wants a lightweight text retrieval system to suggest relevant help-center articles for incoming support queries before escalating to an agent. The team wants to use TF-IDF as a strong baseline because it is fast, interpretable, and easy to deploy.
Data
- Corpus: 120,000 English help-center articles, FAQs, and troubleshooting notes
- Queries: 1.8M historical search and support queries with clicked article IDs
- Text length: queries are 3-25 words; articles are 40-1,200 words
- Language: English only
- Label signal: sparse implicit relevance from clicks; many queries have no clicks
- Vocabulary: product names, error codes, billing terms, and UI phrases
Success Criteria
A good solution should improve article suggestion quality enough that the top 5 results contain a clicked or manually judged relevant article for at least 80% of evaluation queries. It should also remain interpretable so support operations can understand why an article matched a query.
Constraints
- Inference latency under 50ms per query for top-k retrieval
- Must run on a single CPU service instance
- No external API calls or large embedding models in production
- Pipeline should be easy to retrain weekly as articles change
Requirements
- Explain what TF-IDF is and why it is useful for text retrieval and simple text classification.
- Build a preprocessing pipeline for queries and articles using modern Python tools.
- Implement TF-IDF vectorization and cosine-similarity ranking for article retrieval.
- Show how the same representation could support a simple downstream classifier or topic clustering baseline.
- Define evaluation metrics, failure modes, and when TF-IDF should be replaced by denser semantic models.