Long-Context Contract Risk Classification

Business Context

DocuShield is a legal-tech platform used by mid-market fintechs to review vendor and customer contracts before signature. The product processes ~120,000 contracts/month and powers an “AI redline assistant” that flags risky clauses (e.g., unlimited liability, non-standard indemnity, data residency conflicts). Today, reviewers complain that the model misses risks when key definitions appear early in the contract but the operative clause appears much later (e.g., a definition of “Confidential Information” in Section 1, but a carve-out in Section 14). These misses can lead to material financial exposure and regulatory non-compliance (GDPR/UK GDPR, SOC2 commitments).

The team is deciding whether to keep an existing BiLSTM-based classifier (trained on clause windows) or migrate to a Transformer-based approach that can better capture long-context dependencies.

Data Characteristics

Volume: 2.4M labeled clause-level examples from 18 months of contracts.
Task framing: Given a contract (or long excerpt) and a target clause span, predict a risk label for that clause.
Labels (5-way): OK, Needs Review, High Risk, Missing Clause, Not Applicable.
Length:
- Full contracts: 1,500–25,000 tokens (median ~6,200 tokens).
- Target clause spans: 40–400 tokens (median ~120 tokens).
- Long-range references: 20% of examples require resolving references >1,000 tokens away (definitions, exhibits, cross-references).
Language: 97% English, 3% bilingual (English + short EU local-language annexes).
Vocabulary: legal boilerplate, defined terms in ALL CAPS, section numbering, citations (e.g., “Section 12.3(b)”).
Noise: OCR artifacts in ~8% of PDFs (hyphenation, broken headers/footers).

Success Criteria

Reduce “missed high-risk” findings by 30% in production audits.
Offline: High Risk recall ≥ 0.92 and macro-F1 ≥ 0.80 on a temporally held-out test set.
Maintain reviewer trust: provide evidence spans (what earlier text influenced the decision).

Constraints

Latency: p95 < 250 ms per clause request (interactive redlining).
Compute: single A10G GPU per service replica; batch size varies with traffic.
Privacy: contracts cannot leave the VPC; no third-party APIs.
Long context: must handle up to 8k tokens (stretch goal 16k) without catastrophic truncation.

Requirements (Deliverables)

Explain, at an architectural level, how Transformers handle long-context dependencies vs LSTMs, including:
- path length / gradient flow,
- attention vs recurrence,
- compute/memory scaling,
- practical failure modes (truncation, attention dilution).
Propose an end-to-end modeling approach for this task that can use long context (e.g., Longformer/BigBird, hierarchical Transformer, retrieval-augmented chunking).
Define a preprocessing strategy for legal documents (tokenization, section-aware chunking, OCR cleanup, span alignment).
Provide a training + evaluation plan that reflects temporal drift (new templates, new regulations).
Include a minimal but realistic implementation sketch in Python using transformers (and optionally spaCy) that fine-tunes a long-context model and reports the key metrics.

Business Context

The team is deciding whether to keep an existing BiLSTM-based classifier (trained on clause windows) or migrate to a Transformer-based approach that can better capture long-context dependencies.

Data Characteristics

Volume: 2.4M labeled clause-level examples from 18 months of contracts.
Task framing: Given a contract (or long excerpt) and a target clause span, predict a risk label for that clause.
Labels (5-way): OK, Needs Review, High Risk, Missing Clause, Not Applicable.
Length:
- Full contracts: 1,500–25,000 tokens (median ~6,200 tokens).
- Target clause spans: 40–400 tokens (median ~120 tokens).
- Long-range references: 20% of examples require resolving references >1,000 tokens away (definitions, exhibits, cross-references).
Language: 97% English, 3% bilingual (English + short EU local-language annexes).
Vocabulary: legal boilerplate, defined terms in ALL CAPS, section numbering, citations (e.g., “Section 12.3(b)”).
Noise: OCR artifacts in ~8% of PDFs (hyphenation, broken headers/footers).

Success Criteria

Reduce “missed high-risk” findings by 30% in production audits.
Offline: High Risk recall ≥ 0.92 and macro-F1 ≥ 0.80 on a temporally held-out test set.
Maintain reviewer trust: provide evidence spans (what earlier text influenced the decision).

Constraints

Latency: p95 < 250 ms per clause request (interactive redlining).
Compute: single A10G GPU per service replica; batch size varies with traffic.
Privacy: contracts cannot leave the VPC; no third-party APIs.
Long context: must handle up to 8k tokens (stretch goal 16k) without catastrophic truncation.

Requirements (Deliverables)

Explain, at an architectural level, how Transformers handle long-context dependencies vs LSTMs, including:
- path length / gradient flow,
- attention vs recurrence,
- compute/memory scaling,
- practical failure modes (truncation, attention dilution).
Propose an end-to-end modeling approach for this task that can use long context (e.g., Longformer/BigBird, hierarchical Transformer, retrieval-augmented chunking).
Define a preprocessing strategy for legal documents (tokenization, section-aware chunking, OCR cleanup, span alignment).
Provide a training + evaluation plan that reflects temporal drift (new templates, new regulations).
Include a minimal but realistic implementation sketch in Python using transformers (and optionally spaCy) that fine-tunes a long-context model and reports the key metrics.

Business Context

The team is deciding whether to keep an existing BiLSTM-based classifier (trained on clause windows) or migrate to a Transformer-based approach that can better capture long-context dependencies.

Data Characteristics

Volume: 2.4M labeled clause-level examples from 18 months of contracts.
Task framing: Given a contract (or long excerpt) and a target clause span, predict a risk label for that clause.
Labels (5-way): OK, Needs Review, High Risk, Missing Clause, Not Applicable.
Length:
- Full contracts: 1,500–25,000 tokens (median ~6,200 tokens).
- Target clause spans: 40–400 tokens (median ~120 tokens).
- Long-range references: 20% of examples require resolving references >1,000 tokens away (definitions, exhibits, cross-references).
Language: 97% English, 3% bilingual (English + short EU local-language annexes).
Vocabulary: legal boilerplate, defined terms in ALL CAPS, section numbering, citations (e.g., “Section 12.3(b)”).
Noise: OCR artifacts in ~8% of PDFs (hyphenation, broken headers/footers).

Success Criteria

Reduce “missed high-risk” findings by 30% in production audits.
Offline: High Risk recall ≥ 0.92 and macro-F1 ≥ 0.80 on a temporally held-out test set.
Maintain reviewer trust: provide evidence spans (what earlier text influenced the decision).

Constraints

Latency: p95 < 250 ms per clause request (interactive redlining).
Compute: single A10G GPU per service replica; batch size varies with traffic.
Privacy: contracts cannot leave the VPC; no third-party APIs.
Long context: must handle up to 8k tokens (stretch goal 16k) without catastrophic truncation.

Requirements (Deliverables)

Explain, at an architectural level, how Transformers handle long-context dependencies vs LSTMs, including:
- path length / gradient flow,
- attention vs recurrence,
- compute/memory scaling,
- practical failure modes (truncation, attention dilution).
Propose an end-to-end modeling approach for this task that can use long context (e.g., Longformer/BigBird, hierarchical Transformer, retrieval-augmented chunking).
Define a preprocessing strategy for legal documents (tokenization, section-aware chunking, OCR cleanup, span alignment).
Provide a training + evaluation plan that reflects temporal drift (new templates, new regulations).
Include a minimal but realistic implementation sketch in Python using transformers (and optionally spaCy) that fine-tunes a long-context model and reports the key metrics.

Business Context

The team is deciding whether to keep an existing BiLSTM-based classifier (trained on clause windows) or migrate to a Transformer-based approach that can better capture long-context dependencies.

Data Characteristics

Volume: 2.4M labeled clause-level examples from 18 months of contracts.
Task framing: Given a contract (or long excerpt) and a target clause span, predict a risk label for that clause.
Labels (5-way): OK, Needs Review, High Risk, Missing Clause, Not Applicable.
Length:
- Full contracts: 1,500–25,000 tokens (median ~6,200 tokens).
- Target clause spans: 40–400 tokens (median ~120 tokens).
- Long-range references: 20% of examples require resolving references >1,000 tokens away (definitions, exhibits, cross-references).
Language: 97% English, 3% bilingual (English + short EU local-language annexes).
Vocabulary: legal boilerplate, defined terms in ALL CAPS, section numbering, citations (e.g., “Section 12.3(b)”).
Noise: OCR artifacts in ~8% of PDFs (hyphenation, broken headers/footers).

Success Criteria

Reduce “missed high-risk” findings by 30% in production audits.
Offline: High Risk recall ≥ 0.92 and macro-F1 ≥ 0.80 on a temporally held-out test set.
Maintain reviewer trust: provide evidence spans (what earlier text influenced the decision).

Constraints

Latency: p95 < 250 ms per clause request (interactive redlining).
Compute: single A10G GPU per service replica; batch size varies with traffic.
Privacy: contracts cannot leave the VPC; no third-party APIs.
Long context: must handle up to 8k tokens (stretch goal 16k) without catastrophic truncation.

Requirements (Deliverables)

Explain, at an architectural level, how Transformers handle long-context dependencies vs LSTMs, including:
- path length / gradient flow,
- attention vs recurrence,
- compute/memory scaling,
- practical failure modes (truncation, attention dilution).
Propose an end-to-end modeling approach for this task that can use long context (e.g., Longformer/BigBird, hierarchical Transformer, retrieval-augmented chunking).
Define a preprocessing strategy for legal documents (tokenization, section-aware chunking, OCR cleanup, span alignment).
Provide a training + evaluation plan that reflects temporal drift (new templates, new regulations).
Include a minimal but realistic implementation sketch in Python using transformers (and optionally spaCy) that fine-tunes a long-context model and reports the key metrics.

Interview Guides

Business Context

Data Characteristics

Success Criteria

Constraints

Requirements (Deliverables)

Long-Context Contract Risk Classification

Business Context

Data Characteristics

Success Criteria

Constraints

Requirements (Deliverables)

Your Answer

Long-Context Contract Risk Classification

Business Context

Data Characteristics

Success Criteria

Constraints

Requirements (Deliverables)

Long-Context Contract Risk Classification

Business Context

Data Characteristics

Success Criteria

Constraints

Requirements (Deliverables)

Your Answer