Identify Topics in Research Notes

Business Context

InsightLoop, a product research platform, stores thousands of user interview notes and on-site search queries each week. The research team wants an NLP system that groups this text into interpretable topics such as pricing confusion, onboarding friction, feature requests, and trust concerns.

Data

Sources: user research notes, session transcripts, and search queries
Volume: ~180,000 historical documents; ~25,000 new items per week
Text length: search queries are 2-12 tokens; research notes are 30-400 words
Language: 94% English, 6% mixed English with product names, typos, and shorthand
Labels: mostly unlabeled; only ~4,000 notes have analyst-assigned themes for offline validation
Distribution: highly skewed, with many rare or emerging themes

Success Criteria

A good solution should produce coherent, stable topics that analysts can name quickly, achieve strong topic coherence on unlabeled data, and recover at least 80% of manually tagged themes in the labeled subset. The system should support weekly reruns and surface emerging topics without retraining from scratch on all history.

Constraints

Inference and clustering must run on a single CPU machine or one small GPU
Analysts need interpretable topic keywords and representative examples
The pipeline must handle short queries and longer notes in the same system
Personally identifiable information should be removed before modeling

Requirements

Build a topic discovery pipeline for mixed-length text.
Define preprocessing for noisy notes and short queries.
Implement a modern Python solution using embeddings plus clustering or topic modeling.
Return topic labels, top keywords, and representative documents.
Explain how you would evaluate topic quality, stability, and usefulness to researchers.
Describe how you would detect new or drifting topics over time.

Business Context

Data

Sources: user research notes, session transcripts, and search queries
Volume: ~180,000 historical documents; ~25,000 new items per week
Text length: search queries are 2-12 tokens; research notes are 30-400 words
Language: 94% English, 6% mixed English with product names, typos, and shorthand
Labels: mostly unlabeled; only ~4,000 notes have analyst-assigned themes for offline validation
Distribution: highly skewed, with many rare or emerging themes

Success Criteria

Constraints

Inference and clustering must run on a single CPU machine or one small GPU
Analysts need interpretable topic keywords and representative examples
The pipeline must handle short queries and longer notes in the same system
Personally identifiable information should be removed before modeling

Requirements

Build a topic discovery pipeline for mixed-length text.
Define preprocessing for noisy notes and short queries.
Implement a modern Python solution using embeddings plus clustering or topic modeling.
Return topic labels, top keywords, and representative documents.
Explain how you would evaluate topic quality, stability, and usefulness to researchers.
Describe how you would detect new or drifting topics over time.

Business Context

Data

Sources: user research notes, session transcripts, and search queries
Volume: ~180,000 historical documents; ~25,000 new items per week
Text length: search queries are 2-12 tokens; research notes are 30-400 words
Language: 94% English, 6% mixed English with product names, typos, and shorthand
Labels: mostly unlabeled; only ~4,000 notes have analyst-assigned themes for offline validation
Distribution: highly skewed, with many rare or emerging themes

Success Criteria

Constraints

Inference and clustering must run on a single CPU machine or one small GPU
Analysts need interpretable topic keywords and representative examples
The pipeline must handle short queries and longer notes in the same system
Personally identifiable information should be removed before modeling

Requirements

Build a topic discovery pipeline for mixed-length text.
Define preprocessing for noisy notes and short queries.
Implement a modern Python solution using embeddings plus clustering or topic modeling.
Return topic labels, top keywords, and representative documents.
Explain how you would evaluate topic quality, stability, and usefulness to researchers.
Describe how you would detect new or drifting topics over time.

Business Context

Data

Sources: user research notes, session transcripts, and search queries
Volume: ~180,000 historical documents; ~25,000 new items per week
Text length: search queries are 2-12 tokens; research notes are 30-400 words
Language: 94% English, 6% mixed English with product names, typos, and shorthand
Labels: mostly unlabeled; only ~4,000 notes have analyst-assigned themes for offline validation
Distribution: highly skewed, with many rare or emerging themes

Success Criteria

Constraints

Inference and clustering must run on a single CPU machine or one small GPU
Analysts need interpretable topic keywords and representative examples
The pipeline must handle short queries and longer notes in the same system
Personally identifiable information should be removed before modeling

Requirements

Build a topic discovery pipeline for mixed-length text.
Define preprocessing for noisy notes and short queries.
Implement a modern Python solution using embeddings plus clustering or topic modeling.
Return topic labels, top keywords, and representative documents.
Explain how you would evaluate topic quality, stability, and usefulness to researchers.
Describe how you would detect new or drifting topics over time.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Identify Topics in Research Notes

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Identify Topics in Research Notes

Business Context

Data

Success Criteria

Constraints

Requirements

Identify Topics in Research Notes

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer