Business Context
TalentMatch, a recruiting platform, wants to automatically extract candidate skills from uploaded resumes so recruiters can search profiles consistently and match applicants to job requirements. The current keyword-based parser misses synonyms, abbreviations, and skills embedded in project descriptions.
Data
- Volume: 180,000 historical resumes, with 22,000 manually annotated for skill spans
- Text length: 150-2,500 words per resume (median: 680 words)
- Language: English only for the first release
- Format: PDF/DOCX converted to plain text; OCR noise appears in ~12% of files
- Label distribution: 1-40 skill mentions per resume; common labels include programming languages, frameworks, cloud tools, analytics tools, and soft skills
Success Criteria
A good solution should achieve entity-level F1 >= 0.88 on skill extraction, recall >= 0.90 for high-value technical skills, and support batch inference at <300 ms per resume page. Extracted skills should also map to a normalized skill taxonomy with high precision.
Constraints
- Resume text is highly unstructured, with bullets, tables, headers, and inconsistent capitalization
- The model must run in a secure environment with no external API calls
- Recruiters need normalized outputs such as
"PyTorch", "SQL", and "Project Management", not raw text fragments only
Requirements
- Design an NLP pipeline to extract skill entities from unstructured resume text.
- Explain how you would preprocess noisy resume text from PDF/DOCX sources.
- Build a modern Python implementation using a transformer-based NER model.
- Add a normalization step to map extracted spans to a canonical skill dictionary.
- Describe how you would evaluate span extraction, normalization quality, and failure cases.