Business Context
QCompute Insights, a technical content platform serving 2M monthly readers, wants to automatically classify short educational articles into quantum computing topics so editors can route content, improve search, and recommend related material. One high-priority class is quantum entanglement, and the team needs a lightweight supervised model that can distinguish entanglement-related content from other introductory quantum computing topics.
Dataset
You are given a labeled corpus of article excerpts, lecture summaries, and FAQ answers collected from the company knowledge base.
| Feature Group | Count | Examples |
|---|
| Text fields | 3 | title, short_summary, body_excerpt |
| Metadata | 5 | source_type, author_tier, publish_year, article_length, reading_level |
| Engineered text features | 6 | token_count, unique_token_ratio, entanglement_keyword_count, qubit_keyword_count, tf-idf vectors, bigram indicators |
- Size: 48K documents, 14 structured features plus sparse text vectors
- Target: Binary — document is primarily about quantum entanglement (1) vs other quantum computing concepts (0)
- Class balance: Moderately imbalanced — 22% positive, 78% negative
- Missing data: 8% missing in reading_level and 3% missing in author_tier; some documents have empty body excerpts
Success Criteria
A good solution should achieve strong ranking quality and reliable classification performance for editorial workflows:
- F1 score >= 0.82 on the held-out test set
- ROC-AUC >= 0.90
- Precision >= 0.85 at a threshold suitable for auto-tagging
Constraints
- Inference should complete in under 50 ms per document in batch API serving
- Editors need basic interpretability: top terms and feature weights should be explainable
- Retraining should be simple enough to run weekly as new content is published
Deliverables
- Build a binary text classification pipeline for entanglement-related content
- Explain preprocessing and feature engineering choices for mixed text + metadata inputs
- Compare at least one linear baseline with one stronger model
- Select an evaluation threshold based on business needs, not accuracy alone
- Describe how you would deploy and monitor the model in production