Business Context
You’re interviewing for an NLP Engineer role on MercuryPay, a fintech app with 18M monthly active users offering checking accounts, debit cards, and international transfers. MercuryPay’s customer support team handles ~220K tickets/day across chat and email. Many tickets require referencing fast-changing internal policy docs (fee schedules, dispute rules, compliance playbooks). The company wants to deploy a Retrieval-Augmented Generation (RAG) assistant that drafts accurate, policy-grounded responses and reduces average handle time by 30%, while meeting strict financial compliance requirements.
Unlike a pure LLM chatbot, this assistant must cite the exact policy passages used, avoid hallucinating fees/limits, and gracefully escalate when the knowledge base doesn’t contain an answer.
Data Characteristics
MercuryPay maintains:
- Knowledge base (KB): ~85,000 documents (HTML/PDF/Markdown), ~14 GB text after extraction.
- Document length: 200–12,000 tokens (median ~1,100)
- Update frequency: 5–10% of docs change weekly (new promotions, regulatory updates)
- Domain vocabulary: NACHA returns, chargebacks, KYC, AML, MCC codes, interchange, SWIFT/IBAN
- Ticket stream: 220K/day
- User message length: 5–600 words (median ~55)
- Languages: English 88%, Spanish 7%, Portuguese 3%, other 2%
- PII: names, emails, phone numbers, last-4 of SSN, bank account fragments
- Ground truth: 1.5M historical tickets with final agent responses; only ~35% have clean links to the KB articles used.
Success Criteria
A “good” RAG system must:
- Achieve ≥80% “grounded answer rate” (answer supported by retrieved passages) on an internal evaluation set.
- Reduce policy-related escalations by 25% without increasing compliance incidents.
- Provide p95 latency ≤ 1.2s for retrieval + generation at 50 QPS.
- Produce responses that include citations (doc id + section heading) and a confidence / escalation decision.
Constraints
- No PII may be stored in vector DB; queries must be redacted before indexing/logging.
- Must run in a regulated environment: all prompts, retrieved passages, and outputs are auditable for 7 years.
- Model budget: one A10/T4-class GPU per service replica; embedding model must run on CPU or small GPU.
Requirements (Deliverables)
- Explain RAG end-to-end: ingestion → chunking → embeddings → vector index → retrieval → prompt construction → generation → citations.
- Design a chunking and indexing strategy for long policy docs (include how you handle tables and headings).
- Propose a retrieval approach (dense, sparse, hybrid) and justify it for fintech policy text.
- Implement a minimal RAG pipeline in Python (ingest a small KB, build index, retrieve top-k, generate an answer with citations).
- Define an evaluation plan: offline metrics (retrieval + generation), human review rubrics, and production monitoring.
- Describe at least 3 failure modes (e.g., stale docs, semantic drift, prompt injection) and mitigations.
Your answer should be practical: assume you will ship an MVP in 6–8 weeks, then iterate.