Business Context
InsightLoop, a product research platform, stores thousands of user interview notes and on-site search queries each week. The research team wants an NLP system that groups this text into interpretable topics such as pricing confusion, onboarding friction, feature requests, and trust concerns.
Data
- Sources: user research notes, session transcripts, and search queries
- Volume: ~180,000 historical documents; ~25,000 new items per week
- Text length: search queries are 2-12 tokens; research notes are 30-400 words
- Language: 94% English, 6% mixed English with product names, typos, and shorthand
- Labels: mostly unlabeled; only ~4,000 notes have analyst-assigned themes for offline validation
- Distribution: highly skewed, with many rare or emerging themes
Success Criteria
A good solution should produce coherent, stable topics that analysts can name quickly, achieve strong topic coherence on unlabeled data, and recover at least 80% of manually tagged themes in the labeled subset. The system should support weekly reruns and surface emerging topics without retraining from scratch on all history.
Constraints
- Inference and clustering must run on a single CPU machine or one small GPU
- Analysts need interpretable topic keywords and representative examples
- The pipeline must handle short queries and longer notes in the same system
- Personally identifiable information should be removed before modeling
Requirements
- Build a topic discovery pipeline for mixed-length text.
- Define preprocessing for noisy notes and short queries.
- Implement a modern Python solution using embeddings plus clustering or topic modeling.
- Return topic labels, top keywords, and representative documents.
- Explain how you would evaluate topic quality, stability, and usefulness to researchers.
- Describe how you would detect new or drifting topics over time.