Process Client Documents for Intake

Business Context

ClearDocs, a SaaS platform for insurance brokers, wants to automate document intake in a client-facing portal. Customers upload PDFs, scans, and email attachments, and the product must extract key fields and classify document type before routing to downstream workflows.

Data

The system receives approximately 80,000 documents per day across forms, invoices, contracts, IDs, and proof-of-address files. Documents range from 1-20 pages, with OCR text lengths from 30 to 4,000 tokens. Most documents are in English (88%), with smaller volumes in Spanish (9%) and French (3%). Labels are moderately imbalanced: proof-of-identity and invoices are common, while signed policy endorsements and handwritten exception forms are rare. OCR quality varies significantly due to mobile photos, skew, blur, stamps, and missing pages.

Success Criteria

A production-ready solution should achieve at least 90% macro-F1 on document type classification and at least 92% field-level F1 on critical entities such as customer name, policy number, invoice total, effective date, and address. End-to-end processing should complete within 2 seconds per document for the p95 case.

Constraints

Must run in a secure environment; no external API calls for inference
Must support noisy OCR text and partial documents
Predictions must be explainable enough for client support teams to review
The first release should fit on a single GPU-backed service with CPU fallback

Requirements

Design an NLP pipeline for OCR cleanup, document classification, and key-field extraction.
Propose a modern Python implementation using transformers and realistic preprocessing.
Explain how you would handle multilingual text, OCR noise, and class imbalance.
Define offline and online evaluation metrics.
Describe how the output would be exposed in a client-facing product, including confidence scores and human review triggers.

Data

Success Criteria

Requirements

Design an NLP pipeline for OCR cleanup, document classification, and key-field extraction.

Propose a modern Python implementation using transformers and realistic preprocessing.

Explain how you would handle multilingual text, OCR noise, and class imbalance.

Define offline and online evaluation metrics.

Describe how the output would be exposed in a client-facing product, including confidence scores and human review triggers.

Data

Success Criteria

Requirements

Design an NLP pipeline for OCR cleanup, document classification, and key-field extraction.

Propose a modern Python implementation using transformers and realistic preprocessing.

Explain how you would handle multilingual text, OCR noise, and class imbalance.

Define offline and online evaluation metrics.

Describe how the output would be exposed in a client-facing product, including confidence scores and human review triggers.

Data

Success Criteria

Requirements

Design an NLP pipeline for OCR cleanup, document classification, and key-field extraction.

Propose a modern Python implementation using transformers and realistic preprocessing.

Explain how you would handle multilingual text, OCR noise, and class imbalance.

Define offline and online evaluation metrics.

Describe how the output would be exposed in a client-facing product, including confidence scores and human review triggers.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Process Client Documents for Intake

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Process Client Documents for Intake

Business Context

Data

Success Criteria

Constraints

Requirements

Process Client Documents for Intake

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer