Business Context
LexiCore, a contract analytics company, wants to automatically identify key entities in commercial agreements so legal teams can search and review contracts faster. Build a named entity recognition system that extracts structured entities from raw contract text.
Data
You have 180,000 annotated contract clauses collected from SaaS, procurement, and partnership agreements.
- Text length: 40-900 tokens per clause, median 180
- Language: English only
- Entity types:
PARTY, EFFECTIVE_DATE, TERM_DATE, GOVERNING_LAW, PAYMENT_TERM, NOTICE_PERIOD
- Label format: token-level BIO tags
- Class distribution:
PARTY and dates are common; GOVERNING_LAW and NOTICE_PERIOD are relatively sparse
- Noise: OCR artifacts, inconsistent capitalization, section numbering, and legal boilerplate
Success Criteria
A production-ready solution should achieve entity-level macro F1 >= 0.88, with recall >= 0.93 for PARTY and EFFECTIVE_DATE. The model should support batch processing of uploaded contracts and return spans with character offsets.
Constraints
- Inference should stay under 150 ms per clause on a single T4 GPU
- The solution must preserve original text offsets for downstream highlighting in the UI
- Training must fit within 16 GB VRAM
- No external API calls are allowed due to customer confidentiality
Requirements
- Define named entity recognition and explain how it applies to contract analysis.
- Build a token-level NER pipeline using a modern Python stack.
- Describe preprocessing for OCR noise, section markers, and offset preservation.
- Fine-tune a transformer-based model and justify your architecture choice.
- Evaluate the system using entity-level metrics and propose an error analysis plan.
- Show how predictions are converted from BIO tags into structured entity spans.