Extract Resume Skills from CVs

Business Context

TalentMatch, a recruiting platform, wants to automatically extract candidate skills from uploaded resumes so recruiters can search profiles consistently and match applicants to job requirements. The current keyword-based parser misses synonyms, abbreviations, and skills embedded in project descriptions.

Data

Volume: 180,000 historical resumes, with 22,000 manually annotated for skill spans
Text length: 150-2,500 words per resume (median: 680 words)
Language: English only for the first release
Format: PDF/DOCX converted to plain text; OCR noise appears in ~12% of files
Label distribution: 1-40 skill mentions per resume; common labels include programming languages, frameworks, cloud tools, analytics tools, and soft skills

Success Criteria

A good solution should achieve entity-level F1 >= 0.88 on skill extraction, recall >= 0.90 for high-value technical skills, and support batch inference at <300 ms per resume page. Extracted skills should also map to a normalized skill taxonomy with high precision.

Constraints

Resume text is highly unstructured, with bullets, tables, headers, and inconsistent capitalization
The model must run in a secure environment with no external API calls
Recruiters need normalized outputs such as "PyTorch", "SQL", and "Project Management", not raw text fragments only

Requirements

Design an NLP pipeline to extract skill entities from unstructured resume text.
Explain how you would preprocess noisy resume text from PDF/DOCX sources.
Build a modern Python implementation using a transformer-based NER model.
Add a normalization step to map extracted spans to a canonical skill dictionary.
Describe how you would evaluate span extraction, normalization quality, and failure cases.

Business Context

Data

Volume: 180,000 historical resumes, with 22,000 manually annotated for skill spans
Text length: 150-2,500 words per resume (median: 680 words)
Language: English only for the first release
Format: PDF/DOCX converted to plain text; OCR noise appears in ~12% of files
Label distribution: 1-40 skill mentions per resume; common labels include programming languages, frameworks, cloud tools, analytics tools, and soft skills

Success Criteria

Constraints

Resume text is highly unstructured, with bullets, tables, headers, and inconsistent capitalization
The model must run in a secure environment with no external API calls
Recruiters need normalized outputs such as "PyTorch", "SQL", and "Project Management", not raw text fragments only

Requirements

Design an NLP pipeline to extract skill entities from unstructured resume text.
Explain how you would preprocess noisy resume text from PDF/DOCX sources.
Build a modern Python implementation using a transformer-based NER model.
Add a normalization step to map extracted spans to a canonical skill dictionary.
Describe how you would evaluate span extraction, normalization quality, and failure cases.

Business Context

Data

Volume: 180,000 historical resumes, with 22,000 manually annotated for skill spans
Text length: 150-2,500 words per resume (median: 680 words)
Language: English only for the first release
Format: PDF/DOCX converted to plain text; OCR noise appears in ~12% of files
Label distribution: 1-40 skill mentions per resume; common labels include programming languages, frameworks, cloud tools, analytics tools, and soft skills

Success Criteria

Constraints

Resume text is highly unstructured, with bullets, tables, headers, and inconsistent capitalization
The model must run in a secure environment with no external API calls
Recruiters need normalized outputs such as "PyTorch", "SQL", and "Project Management", not raw text fragments only

Requirements

Design an NLP pipeline to extract skill entities from unstructured resume text.
Explain how you would preprocess noisy resume text from PDF/DOCX sources.
Build a modern Python implementation using a transformer-based NER model.
Add a normalization step to map extracted spans to a canonical skill dictionary.
Describe how you would evaluate span extraction, normalization quality, and failure cases.

Business Context

Data

Volume: 180,000 historical resumes, with 22,000 manually annotated for skill spans
Text length: 150-2,500 words per resume (median: 680 words)
Language: English only for the first release
Format: PDF/DOCX converted to plain text; OCR noise appears in ~12% of files
Label distribution: 1-40 skill mentions per resume; common labels include programming languages, frameworks, cloud tools, analytics tools, and soft skills

Success Criteria

Constraints

Resume text is highly unstructured, with bullets, tables, headers, and inconsistent capitalization
The model must run in a secure environment with no external API calls
Recruiters need normalized outputs such as "PyTorch", "SQL", and "Project Management", not raw text fragments only

Requirements

Design an NLP pipeline to extract skill entities from unstructured resume text.
Explain how you would preprocess noisy resume text from PDF/DOCX sources.
Build a modern Python implementation using a transformer-based NER model.
Add a normalization step to map extracted spans to a canonical skill dictionary.
Describe how you would evaluate span extraction, normalization quality, and failure cases.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Extract Resume Skills from CVs

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Extract Resume Skills from CVs

Business Context

Data

Success Criteria

Constraints

Requirements

Extract Resume Skills from CVs

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer