Business Context
PulseWire, a media intelligence platform, ingests millions of news articles, blog posts, and press releases each month. The editorial analytics team wants an unsupervised topic modeling system to surface emerging themes, cluster related content, and track topic drift over time without relying on manual labels.
Data
- Volume: 8M English documents collected over 18 months
- Text length: 80-2,500 words per document (median: 620)
- Language: Primarily English, with ~4% noisy non-English or mixed-language content
- Content mix: News, opinion pieces, syndicated content, and duplicate wire copies
- Label availability: No reliable topic labels; only metadata such as source, publish date, and section
Success Criteria
A good solution should produce interpretable topics, minimize duplicate or overly generic topics, and support monthly retraining on newly ingested documents. Topic quality should be strong enough that editors can name at least 80% of top topics from representative keywords and sample documents.
Constraints
- Must run on a distributed or batched pipeline using commodity GPU/CPU resources
- Training should complete within 8 hours for a monthly refresh
- Inference should support assigning topics to new documents in near real time
- The system must handle noisy text, duplicates, and evolving vocabulary
Requirements
- Design an end-to-end topic modeling pipeline for a large unlabeled corpus.
- Explain preprocessing choices for long-form editorial text, including deduplication and vocabulary filtering.
- Implement a modern Python solution using transformer embeddings plus a scalable topic modeling method.
- Describe how you would choose the number of topics or allow it to emerge from the data.
- Define how you would evaluate topic coherence, diversity, stability, and business usefulness.
- Explain how you would monitor topic drift and update the model over time.