Classify Quantum Computing Research Topics

Business Context

QCompute Insights, a technical content platform serving 2M monthly readers, wants to automatically classify short educational articles into quantum computing topics so editors can route content, improve search, and recommend related material. One high-priority class is quantum entanglement, and the team needs a lightweight supervised model that can distinguish entanglement-related content from other introductory quantum computing topics.

Dataset

You are given a labeled corpus of article excerpts, lecture summaries, and FAQ answers collected from the company knowledge base.

Feature Group	Count	Examples
Text fields	3	title, short_summary, body_excerpt
Metadata	5	source_type, author_tier, publish_year, article_length, reading_level
Engineered text features	6	token_count, unique_token_ratio, entanglement_keyword_count, qubit_keyword_count, tf-idf vectors, bigram indicators

Size: 48K documents, 14 structured features plus sparse text vectors
Target: Binary — document is primarily about quantum entanglement (1) vs other quantum computing concepts (0)
Class balance: Moderately imbalanced — 22% positive, 78% negative
Missing data: 8% missing in reading_level and 3% missing in author_tier; some documents have empty body excerpts

Success Criteria

A good solution should achieve strong ranking quality and reliable classification performance for editorial workflows:

F1 score >= 0.82 on the held-out test set
ROC-AUC >= 0.90
Precision >= 0.85 at a threshold suitable for auto-tagging

Constraints

Inference should complete in under 50 ms per document in batch API serving
Editors need basic interpretability: top terms and feature weights should be explainable
Retraining should be simple enough to run weekly as new content is published

Deliverables

Build a binary text classification pipeline for entanglement-related content
Explain preprocessing and feature engineering choices for mixed text + metadata inputs
Compare at least one linear baseline with one stronger model
Select an evaluation threshold based on business needs, not accuracy alone
Describe how you would deploy and monitor the model in production

Business Context

Dataset

You are given a labeled corpus of article excerpts, lecture summaries, and FAQ answers collected from the company knowledge base.

Feature Group	Count	Examples
Text fields	3	title, short_summary, body_excerpt
Metadata	5	source_type, author_tier, publish_year, article_length, reading_level
Engineered text features	6	token_count, unique_token_ratio, entanglement_keyword_count, qubit_keyword_count, tf-idf vectors, bigram indicators

Size: 48K documents, 14 structured features plus sparse text vectors
Target: Binary — document is primarily about quantum entanglement (1) vs other quantum computing concepts (0)
Class balance: Moderately imbalanced — 22% positive, 78% negative
Missing data: 8% missing in reading_level and 3% missing in author_tier; some documents have empty body excerpts

Success Criteria

A good solution should achieve strong ranking quality and reliable classification performance for editorial workflows:

F1 score >= 0.82 on the held-out test set
ROC-AUC >= 0.90
Precision >= 0.85 at a threshold suitable for auto-tagging

Constraints

Inference should complete in under 50 ms per document in batch API serving
Editors need basic interpretability: top terms and feature weights should be explainable
Retraining should be simple enough to run weekly as new content is published

Deliverables

Build a binary text classification pipeline for entanglement-related content
Explain preprocessing and feature engineering choices for mixed text + metadata inputs
Compare at least one linear baseline with one stronger model
Select an evaluation threshold based on business needs, not accuracy alone
Describe how you would deploy and monitor the model in production

Business Context

Dataset

You are given a labeled corpus of article excerpts, lecture summaries, and FAQ answers collected from the company knowledge base.

Feature Group	Count	Examples
Text fields	3	title, short_summary, body_excerpt
Metadata	5	source_type, author_tier, publish_year, article_length, reading_level
Engineered text features	6	token_count, unique_token_ratio, entanglement_keyword_count, qubit_keyword_count, tf-idf vectors, bigram indicators

Size: 48K documents, 14 structured features plus sparse text vectors
Target: Binary — document is primarily about quantum entanglement (1) vs other quantum computing concepts (0)
Class balance: Moderately imbalanced — 22% positive, 78% negative
Missing data: 8% missing in reading_level and 3% missing in author_tier; some documents have empty body excerpts

Success Criteria

A good solution should achieve strong ranking quality and reliable classification performance for editorial workflows:

F1 score >= 0.82 on the held-out test set
ROC-AUC >= 0.90
Precision >= 0.85 at a threshold suitable for auto-tagging

Constraints

Inference should complete in under 50 ms per document in batch API serving
Editors need basic interpretability: top terms and feature weights should be explainable
Retraining should be simple enough to run weekly as new content is published

Deliverables

Build a binary text classification pipeline for entanglement-related content
Explain preprocessing and feature engineering choices for mixed text + metadata inputs
Compare at least one linear baseline with one stronger model
Select an evaluation threshold based on business needs, not accuracy alone
Describe how you would deploy and monitor the model in production

Business Context

Dataset

You are given a labeled corpus of article excerpts, lecture summaries, and FAQ answers collected from the company knowledge base.

Feature Group	Count	Examples
Text fields	3	title, short_summary, body_excerpt
Metadata	5	source_type, author_tier, publish_year, article_length, reading_level
Engineered text features	6	token_count, unique_token_ratio, entanglement_keyword_count, qubit_keyword_count, tf-idf vectors, bigram indicators

Size: 48K documents, 14 structured features plus sparse text vectors
Target: Binary — document is primarily about quantum entanglement (1) vs other quantum computing concepts (0)
Class balance: Moderately imbalanced — 22% positive, 78% negative
Missing data: 8% missing in reading_level and 3% missing in author_tier; some documents have empty body excerpts

Success Criteria

A good solution should achieve strong ranking quality and reliable classification performance for editorial workflows:

F1 score >= 0.82 on the held-out test set
ROC-AUC >= 0.90
Precision >= 0.85 at a threshold suitable for auto-tagging

Constraints

Inference should complete in under 50 ms per document in batch API serving
Editors need basic interpretability: top terms and feature weights should be explainable
Retraining should be simple enough to run weekly as new content is published

Deliverables

Build a binary text classification pipeline for entanglement-related content
Explain preprocessing and feature engineering choices for mixed text + metadata inputs
Compare at least one linear baseline with one stronger model
Select an evaluation threshold based on business needs, not accuracy alone
Describe how you would deploy and monitor the model in production

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify Quantum Computing Research Topics

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Classify Quantum Computing Research Topics

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify Quantum Computing Research Topics

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer