Model News Topics at Scale

Business Context

PulseWire, a media intelligence platform, ingests millions of news articles, blog posts, and press releases each month. The editorial analytics team wants an unsupervised topic modeling system to surface emerging themes, cluster related content, and track topic drift over time without relying on manual labels.

Data

Volume: 8M English documents collected over 18 months
Text length: 80-2,500 words per document (median: 620)
Language: Primarily English, with ~4% noisy non-English or mixed-language content
Content mix: News, opinion pieces, syndicated content, and duplicate wire copies
Label availability: No reliable topic labels; only metadata such as source, publish date, and section

Success Criteria

A good solution should produce interpretable topics, minimize duplicate or overly generic topics, and support monthly retraining on newly ingested documents. Topic quality should be strong enough that editors can name at least 80% of top topics from representative keywords and sample documents.

Constraints

Must run on a distributed or batched pipeline using commodity GPU/CPU resources
Training should complete within 8 hours for a monthly refresh
Inference should support assigning topics to new documents in near real time
The system must handle noisy text, duplicates, and evolving vocabulary

Requirements

Design an end-to-end topic modeling pipeline for a large unlabeled corpus.
Explain preprocessing choices for long-form editorial text, including deduplication and vocabulary filtering.
Implement a modern Python solution using transformer embeddings plus a scalable topic modeling method.
Describe how you would choose the number of topics or allow it to emerge from the data.
Define how you would evaluate topic coherence, diversity, stability, and business usefulness.
Explain how you would monitor topic drift and update the model over time.

Business Context

Data

Volume: 8M English documents collected over 18 months
Text length: 80-2,500 words per document (median: 620)
Language: Primarily English, with ~4% noisy non-English or mixed-language content
Content mix: News, opinion pieces, syndicated content, and duplicate wire copies
Label availability: No reliable topic labels; only metadata such as source, publish date, and section

Success Criteria

Constraints

Must run on a distributed or batched pipeline using commodity GPU/CPU resources
Training should complete within 8 hours for a monthly refresh
Inference should support assigning topics to new documents in near real time
The system must handle noisy text, duplicates, and evolving vocabulary

Requirements

Design an end-to-end topic modeling pipeline for a large unlabeled corpus.
Explain preprocessing choices for long-form editorial text, including deduplication and vocabulary filtering.
Implement a modern Python solution using transformer embeddings plus a scalable topic modeling method.
Describe how you would choose the number of topics or allow it to emerge from the data.
Define how you would evaluate topic coherence, diversity, stability, and business usefulness.
Explain how you would monitor topic drift and update the model over time.

Business Context

Data

Volume: 8M English documents collected over 18 months
Text length: 80-2,500 words per document (median: 620)
Language: Primarily English, with ~4% noisy non-English or mixed-language content
Content mix: News, opinion pieces, syndicated content, and duplicate wire copies
Label availability: No reliable topic labels; only metadata such as source, publish date, and section

Success Criteria

Constraints

Must run on a distributed or batched pipeline using commodity GPU/CPU resources
Training should complete within 8 hours for a monthly refresh
Inference should support assigning topics to new documents in near real time
The system must handle noisy text, duplicates, and evolving vocabulary

Requirements

Design an end-to-end topic modeling pipeline for a large unlabeled corpus.
Explain preprocessing choices for long-form editorial text, including deduplication and vocabulary filtering.
Implement a modern Python solution using transformer embeddings plus a scalable topic modeling method.
Describe how you would choose the number of topics or allow it to emerge from the data.
Define how you would evaluate topic coherence, diversity, stability, and business usefulness.
Explain how you would monitor topic drift and update the model over time.

Business Context

Data

Volume: 8M English documents collected over 18 months
Text length: 80-2,500 words per document (median: 620)
Language: Primarily English, with ~4% noisy non-English or mixed-language content
Content mix: News, opinion pieces, syndicated content, and duplicate wire copies
Label availability: No reliable topic labels; only metadata such as source, publish date, and section

Success Criteria

Constraints

Must run on a distributed or batched pipeline using commodity GPU/CPU resources
Training should complete within 8 hours for a monthly refresh
Inference should support assigning topics to new documents in near real time
The system must handle noisy text, duplicates, and evolving vocabulary

Requirements

Design an end-to-end topic modeling pipeline for a large unlabeled corpus.
Explain preprocessing choices for long-form editorial text, including deduplication and vocabulary filtering.
Implement a modern Python solution using transformer embeddings plus a scalable topic modeling method.
Describe how you would choose the number of topics or allow it to emerge from the data.
Define how you would evaluate topic coherence, diversity, stability, and business usefulness.
Explain how you would monitor topic drift and update the model over time.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Model News Topics at Scale

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Model News Topics at Scale

Business Context

Data

Success Criteria

Constraints

Requirements

Model News Topics at Scale

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer