Business Context
DocuShield is a legal-tech platform used by mid-market fintechs to review vendor and customer contracts before signature. The product processes ~120,000 contracts/month and powers an “AI redline assistant” that flags risky clauses (e.g., unlimited liability, non-standard indemnity, data residency conflicts). Today, reviewers complain that the model misses risks when key definitions appear early in the contract but the operative clause appears much later (e.g., a definition of “Confidential Information” in Section 1, but a carve-out in Section 14). These misses can lead to material financial exposure and regulatory non-compliance (GDPR/UK GDPR, SOC2 commitments).
The team is deciding whether to keep an existing BiLSTM-based classifier (trained on clause windows) or migrate to a Transformer-based approach that can better capture long-context dependencies.
Data Characteristics
- Volume: 2.4M labeled clause-level examples from 18 months of contracts.
- Task framing: Given a contract (or long excerpt) and a target clause span, predict a risk label for that clause.
- Labels (5-way):
OK, Needs Review, High Risk, Missing Clause, Not Applicable.
- Length:
- Full contracts: 1,500–25,000 tokens (median ~6,200 tokens).
- Target clause spans: 40–400 tokens (median ~120 tokens).
- Long-range references: 20% of examples require resolving references >1,000 tokens away (definitions, exhibits, cross-references).
- Language: 97% English, 3% bilingual (English + short EU local-language annexes).
- Vocabulary: legal boilerplate, defined terms in ALL CAPS, section numbering, citations (e.g., “Section 12.3(b)”).
- Noise: OCR artifacts in ~8% of PDFs (hyphenation, broken headers/footers).
Success Criteria
- Reduce “missed high-risk” findings by 30% in production audits.
- Offline: High Risk recall ≥ 0.92 and macro-F1 ≥ 0.80 on a temporally held-out test set.
- Maintain reviewer trust: provide evidence spans (what earlier text influenced the decision).
Constraints
- Latency: p95 < 250 ms per clause request (interactive redlining).
- Compute: single A10G GPU per service replica; batch size varies with traffic.
- Privacy: contracts cannot leave the VPC; no third-party APIs.
- Long context: must handle up to 8k tokens (stretch goal 16k) without catastrophic truncation.
Requirements (Deliverables)
- Explain, at an architectural level, how Transformers handle long-context dependencies vs LSTMs, including:
- path length / gradient flow,
- attention vs recurrence,
- compute/memory scaling,
- practical failure modes (truncation, attention dilution).
- Propose an end-to-end modeling approach for this task that can use long context (e.g., Longformer/BigBird, hierarchical Transformer, retrieval-augmented chunking).
- Define a preprocessing strategy for legal documents (tokenization, section-aware chunking, OCR cleanup, span alignment).
- Provide a training + evaluation plan that reflects temporal drift (new templates, new regulations).
- Include a minimal but realistic implementation sketch in Python using
transformers (and optionally spaCy) that fine-tunes a long-context model and reports the key metrics.