Business Context
AcmeCloud wants to connect an internal LLM assistant to company data sources so employees can ask questions about product docs, support runbooks, API references, and policy manuals. The goal is to reduce hallucinations and return grounded answers with citations.
Data
- Sources: Confluence pages, PDFs, HTML docs, support articles, and internal markdown repositories
- Volume: ~2 million documents, updated daily
- Text length: 50-5,000 words per document; many documents contain tables, code blocks, and headings
- Language: Primarily English, with ~10% multilingual content
- Labels available: Weak supervision only (click logs, thumbs-up/down, and a small set of curated Q&A pairs)
Success Criteria
A good solution should improve answer grounding and freshness while keeping median retrieval + generation latency under 2 seconds. Target at least 85% answer relevance on a human-reviewed benchmark and clear source attribution for every response.
Constraints
- Sensitive data must stay in the company VPC
- Incremental indexing is required for daily document updates
- The system must support metadata filtering by team, product, and access level
- Budget allows one medium embedding model and one instruction-tuned generator in production
Requirements
- Design an end-to-end retrieval-augmented generation (RAG) pipeline connecting the LLM to external data sources.
- Define document ingestion, cleaning, chunking, embedding, indexing, and retrieval steps.
- Explain how you would handle structured artifacts such as tables, code snippets, and duplicated content.
- Propose a modern Python implementation using
transformers, sentence embeddings, and a vector database.
- Describe how you would evaluate retrieval quality, answer quality, citation accuracy, and latency.
- Include strategies for access control, prompt construction, and failure handling when retrieval is weak.