Business Context
Northstar Bank wants an internal assistant that answers employee questions using company policies, SOPs, and compliance manuals. Design a Retrieval-Augmented Generation (RAG) system that can be deployed in a regulated enterprise environment and return grounded answers with citations.
Data
- Corpus size: ~2.5 million documents across PDFs, Word files, HTML pages, wiki articles, and ticket attachments
- Text length: 1 paragraph to 200+ pages; many documents contain tables, headers, footers, and repeated boilerplate
- Language: 94% English, 6% multilingual content
- Freshness: ~15,000 document updates per day
- Label availability: No supervised QA labels initially; only click logs, search logs, and SME review are available
Success Criteria
- Top-5 retrieval recall high enough that SMEs judge at least 90% of answerable queries as having supporting evidence retrieved
- End-to-end p95 latency under 2.5 seconds for interactive use
- Answers must include source citations and abstain when evidence is weak
- System must support access control, audit logging, and safe deployment in a private environment
Constraints
- No sensitive documents may leave the enterprise VPC
- Must enforce document-level ACLs at retrieval time
- Should run on a modest GPU budget and degrade gracefully to CPU-heavy retrieval
- Hallucinations are unacceptable for compliance-related answers
Requirements
- Design the ingestion, chunking, embedding, indexing, retrieval, reranking, and generation pipeline.
- Explain preprocessing for noisy enterprise documents, including OCR, deduplication, metadata extraction, and chunking strategy.
- Propose a modern Python implementation using
transformers, sentence-transformers, and a vector index.
- Describe how you would handle ACL filtering, monitoring, prompt grounding, caching, and document refresh.
- Define an evaluation plan for retrieval quality, answer faithfulness, latency, and safe fallback behavior.