Extract Legal Entities from Contracts

Business Context

LexiCore, a contract analytics company, wants to automatically identify key entities in commercial agreements so legal teams can search and review contracts faster. Build a named entity recognition system that extracts structured entities from raw contract text.

Data

You have 180,000 annotated contract clauses collected from SaaS, procurement, and partnership agreements.

Text length: 40-900 tokens per clause, median 180
Language: English only
Entity types: PARTY, EFFECTIVE_DATE, TERM_DATE, GOVERNING_LAW, PAYMENT_TERM, NOTICE_PERIOD
Label format: token-level BIO tags
Class distribution: PARTY and dates are common; GOVERNING_LAW and NOTICE_PERIOD are relatively sparse
Noise: OCR artifacts, inconsistent capitalization, section numbering, and legal boilerplate

Success Criteria

A production-ready solution should achieve entity-level macro F1 >= 0.88, with recall >= 0.93 for PARTY and EFFECTIVE_DATE. The model should support batch processing of uploaded contracts and return spans with character offsets.

Constraints

Inference should stay under 150 ms per clause on a single T4 GPU
The solution must preserve original text offsets for downstream highlighting in the UI
Training must fit within 16 GB VRAM
No external API calls are allowed due to customer confidentiality

Requirements

Define named entity recognition and explain how it applies to contract analysis.
Build a token-level NER pipeline using a modern Python stack.
Describe preprocessing for OCR noise, section markers, and offset preservation.
Fine-tune a transformer-based model and justify your architecture choice.
Evaluate the system using entity-level metrics and propose an error analysis plan.
Show how predictions are converted from BIO tags into structured entity spans.

Business Context

Data

You have 180,000 annotated contract clauses collected from SaaS, procurement, and partnership agreements.

Text length: 40-900 tokens per clause, median 180
Language: English only
Entity types: PARTY, EFFECTIVE_DATE, TERM_DATE, GOVERNING_LAW, PAYMENT_TERM, NOTICE_PERIOD
Label format: token-level BIO tags
Class distribution: PARTY and dates are common; GOVERNING_LAW and NOTICE_PERIOD are relatively sparse
Noise: OCR artifacts, inconsistent capitalization, section numbering, and legal boilerplate

Success Criteria

Constraints

Inference should stay under 150 ms per clause on a single T4 GPU
The solution must preserve original text offsets for downstream highlighting in the UI
Training must fit within 16 GB VRAM
No external API calls are allowed due to customer confidentiality

Requirements

Define named entity recognition and explain how it applies to contract analysis.
Build a token-level NER pipeline using a modern Python stack.
Describe preprocessing for OCR noise, section markers, and offset preservation.
Fine-tune a transformer-based model and justify your architecture choice.
Evaluate the system using entity-level metrics and propose an error analysis plan.
Show how predictions are converted from BIO tags into structured entity spans.

Business Context

Data

You have 180,000 annotated contract clauses collected from SaaS, procurement, and partnership agreements.

Text length: 40-900 tokens per clause, median 180
Language: English only
Entity types: PARTY, EFFECTIVE_DATE, TERM_DATE, GOVERNING_LAW, PAYMENT_TERM, NOTICE_PERIOD
Label format: token-level BIO tags
Class distribution: PARTY and dates are common; GOVERNING_LAW and NOTICE_PERIOD are relatively sparse
Noise: OCR artifacts, inconsistent capitalization, section numbering, and legal boilerplate

Success Criteria

Constraints

Inference should stay under 150 ms per clause on a single T4 GPU
The solution must preserve original text offsets for downstream highlighting in the UI
Training must fit within 16 GB VRAM
No external API calls are allowed due to customer confidentiality

Requirements

Define named entity recognition and explain how it applies to contract analysis.
Build a token-level NER pipeline using a modern Python stack.
Describe preprocessing for OCR noise, section markers, and offset preservation.
Fine-tune a transformer-based model and justify your architecture choice.
Evaluate the system using entity-level metrics and propose an error analysis plan.
Show how predictions are converted from BIO tags into structured entity spans.

Business Context

Data

You have 180,000 annotated contract clauses collected from SaaS, procurement, and partnership agreements.

Text length: 40-900 tokens per clause, median 180
Language: English only
Entity types: PARTY, EFFECTIVE_DATE, TERM_DATE, GOVERNING_LAW, PAYMENT_TERM, NOTICE_PERIOD
Label format: token-level BIO tags
Class distribution: PARTY and dates are common; GOVERNING_LAW and NOTICE_PERIOD are relatively sparse
Noise: OCR artifacts, inconsistent capitalization, section numbering, and legal boilerplate

Success Criteria

Constraints

Inference should stay under 150 ms per clause on a single T4 GPU
The solution must preserve original text offsets for downstream highlighting in the UI
Training must fit within 16 GB VRAM
No external API calls are allowed due to customer confidentiality

Requirements

Define named entity recognition and explain how it applies to contract analysis.
Build a token-level NER pipeline using a modern Python stack.
Describe preprocessing for OCR noise, section markers, and offset preservation.
Fine-tune a transformer-based model and justify your architecture choice.
Evaluate the system using entity-level metrics and propose an error analysis plan.
Show how predictions are converted from BIO tags into structured entity spans.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Extract Legal Entities from Contracts

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Extract Legal Entities from Contracts

Business Context

Data

Success Criteria

Constraints

Requirements

Extract Legal Entities from Contracts

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer