Business Context
TalentMatch, a recruiting platform for mid-sized employers, wants to automate first-pass resume screening for software engineering roles. Recruiters currently review thousands of resumes manually, so the goal is to rank or classify candidates as advance vs do not advance based on resume text.
Data
- Volume: 180,000 labeled resumes collected over 18 months
- Text length: 80-2,500 words per resume (median: 620 words)
- Language: Primarily English (96%), with minor formatting noise from PDF/DOCX extraction
- Labels: Binary classes —
advance (28%) and reject (72%)
- Common issues: OCR artifacts, repeated headers, bullet-heavy formatting, inconsistent section names, and missing education/work history fields
Success Criteria
A good solution should achieve F1 ≥ 0.82 on the minority advance class, precision ≥ 0.80 to reduce recruiter overload, and inference latency under 150ms per resume in batch scoring.
Constraints
- The model must run in a secure environment; resumes cannot be sent to third-party APIs
- Recruiters need basic explainability for why a resume was advanced
- The system should be easy to retrain weekly as new hiring outcomes arrive
- Bias-sensitive attributes such as name, gendered terms, age indicators, and full addresses should not drive predictions
Requirements
- Build a binary text classification pipeline for resume screening
- Define a realistic preprocessing workflow for noisy resume text
- Implement a strong baseline and one transformer-based model in Python
- Address class imbalance and threshold selection
- Evaluate the model with recruiter-relevant metrics and error analysis
- Describe how you would reduce bias and prevent leakage from non-job-related signals