LexCore, a contract analytics platform, wants to automatically extract key entities from commercial agreements so legal ops teams can review obligations faster. You need to fine-tune a HuggingFace transformer model for a custom NER pipeline that identifies contract-specific entities.
The training corpus contains 48,000 annotated clauses from MSAs, NDAs, DPAs, and vendor agreements. Documents are in English, with occasional OCR noise from scanned PDFs. Text length ranges from 20 to 900 tokens per clause (median: 140). Labels follow BIO tagging and include PARTY, EFFECTIVE_DATE, TERM, RENEWAL_NOTICE, GOVERNING_LAW, PAYMENT_TERM, LIABILITY_CAP, and O. The label distribution is imbalanced: O dominates, while entities such as LIABILITY_CAP and RENEWAL_NOTICE appear in fewer than 4% of clauses.
A strong solution should achieve entity-level macro F1 >= 0.84, with F1 >= 0.90 on PARTY and EFFECTIVE_DATE, and stable performance on long clauses with nested legal phrasing.