Business Context
MercuryPay is a fintech app with 18M monthly active users offering debit cards, P2P transfers, and small-business checking. The company processes ~120,000 customer-support chat conversations per day. A new policy requires the system to reliably detect and route high-risk issues (e.g., account takeover, chargeback fraud, AML-related questions) to a specialized queue within 30 seconds, while routine issues (password reset, card shipping, fee questions) should be handled by a lower-cost automation flow.
You’re asked to design an LLM-based routing system and explicitly decide when to use prompt engineering vs fine-tuning (and potentially a hybrid). The current baseline is a TF-IDF + linear classifier that struggles with emerging fraud phrasing and new product features, causing misroutes that increase fraud losses and violate internal SLAs.
Data Characteristics
- Volume: 9.5M historical chat transcripts (last 12 months); 2.1M have high-quality agent-applied labels.
- Text structure: Each transcript contains 3–30 turns (median 9). You may classify at the conversation level using the most recent N turns.
- Length: 40–1,200 tokens (median ~220 tokens) after removing boilerplate.
- Language: 88% English, 8% Spanish, 4% mixed/other.
- Labels (multi-class, mutually exclusive):
| Route | Approx. Share | Notes |
|---|
| Account Takeover (ATO) | 2.5% | Highest severity; false negatives are costly |
| Card / Chargeback Fraud | 4.0% | Often ambiguous language, evolving patterns |
| Payments / Transfer Issues | 18% | Includes failed transfers, pending transactions |
| KYC / AML / Compliance | 3.5% | Regulated; requires careful handling |
| Account Access / Password | 22% | High volume, repetitive |
| Fees / Pricing / Disputes | 20% | Often sentiment-heavy |
| General / Other | 30% | Catch-all; noisy |
Success Criteria
- ATO recall ≥ 97% on a temporally held-out test set (fraud loss prevention).
- Macro-F1 ≥ 0.84 across all routes.
- P95 latency ≤ 250 ms per conversation on a single A10 GPU (batching allowed).
- Auditability: Provide a rationale trace (e.g., top contributing spans or retrieved policy snippets) suitable for internal compliance review.
Constraints
- Data residency: Training and inference must run in MercuryPay’s VPC; no third-party hosted APIs.
- PII: Chats contain emails, phone numbers, last-4 SSN, and transaction IDs; must be redacted before storage in training artifacts.
- Concept drift: Fraud language changes weekly; product launches change support topics.
Requirements (Deliverables)
- Explain the key differences between prompt engineering and fine-tuning in this setting: what changes (weights vs instructions), data needs, iteration speed, failure modes, and governance.
- Propose a decision framework: for each label group, justify whether you’d start with prompting, fine-tuning, or a hybrid (e.g., prompt + small LoRA adapter).
- Provide a baseline implementation that supports both:
- (a) a prompt-based classifier (few-shot) using a local instruction-tuned model, and
- (b) a fine-tuned classifier (LoRA/PEFT) on labeled chats.
- Define an evaluation plan including temporal validation, class-imbalance handling, and error analysis focused on ATO false negatives.
- Describe how you would monitor drift and decide when to re-prompt vs re-train.