Business Context
At NexusAI, recruiting teams review hundreds of candidate responses to open-ended application questions about reinforcement learning (RL) and large language model (LLM) research interests. They want an NLP system that automatically categorizes each response into research themes so recruiters can route candidates to the right interview panel.
Data
You are given 180,000 historical candidate responses collected over 18 months.
- Task: Multi-label classification of research-interest statements
- Labels:
alignment_safety, reasoning_agents, rlhf_post_training, multimodal, efficient_training, evaluation, other
- Volume: ~180K responses, with 1-3 labels per response
- Text length: 20-350 words, median 95 words
- Language: English only
- Distribution: Long-tailed;
alignment_safety and reasoning_agents are common, multimodal and efficient_training are less frequent
- Noise: Some responses contain buzzwords, copied text, or vague statements with no actionable theme
Success Criteria
A good solution should achieve macro-F1 >= 0.82, micro-F1 >= 0.88, and precision >= 0.75 on minority labels. Predictions should be calibrated enough to support threshold-based routing.
Constraints
- Inference latency must stay under 120 ms per response in batch scoring
- Training must fit on a single A10 or T4 GPU
- Recruiters need interpretable outputs, including top predicted themes and confidence scores
Requirements
- Build a multi-label NLP classifier for candidate research-interest responses.
- Design a preprocessing pipeline for short, technical free-text answers.
- Implement training and evaluation in modern Python using transformers.
- Handle class imbalance and threshold tuning for multi-label outputs.
- Describe how you would analyze vague, overlapping, or emerging research themes.