Business Context
StripeShield (a fictional fintech) provides fraud prevention and account support for 18M monthly active users across the US and EU. The in-app support chat handles ~220K conversations/day, and a subset (chargebacks, account takeover, card testing) requires real-time intervention: if the system can’t route the conversation to the right queue within seconds, fraud losses increase and legitimate users get locked out.
The team wants to use an LLM-driven triage service that (1) classifies each incoming user message into an operational category and urgency, and (2) extracts key entities (transaction IDs, amounts, merchants) to pre-fill internal tools. The service must run in production with predictable latency and strong safety/compliance guarantees.
Data Characteristics
- Volume: 9.5M historical chat messages with human-applied routing labels; 60 days of logs available for near-real-time evaluation.
- Text length: 3–900 tokens; median 42 tokens. Many messages are short, fragmented, and multi-turn.
- Language: English (78%), Spanish (11%), Portuguese (6%), French (3%), other (2%). Code-switching appears in ~4% of non-English chats.
- Domain vocabulary: chargeback reason codes, KYC/AML terms, “3DS”, “ACH return”, “merchant descriptor”, “BIN”, “dispute window”.
- Label distribution (routing):
| Label | Meaning | Proportion |
|---|
| ATO | Account takeover / login compromised | 2.5% |
| FRAUD | Unauthorized transaction / card testing | 6.0% |
| DISPUTE | Chargeback/dispute workflow | 14.5% |
| PAYOUTS | Payout holds / reserves | 7.0% |
| KYC | Identity verification | 10.0% |
| BILLING | Subscription/invoice/payment method | 35.0% |
| GENERAL | Everything else | 25.0% |
Success Criteria
- Business: Reduce median time-to-correct-queue from 6 minutes to <30 seconds for ATO/FRAUD, and cut fraud loss by 0.5–1.0 bps.
- Model: Macro-F1 ≥ 0.86 overall; Recall ≥ 0.95 on ATO (false negatives are high-risk).
- Entity extraction: ≥ 0.90 F1 for
transaction_id and amount on a labeled evaluation set.
Constraints (Real-Time LLM Challenges)
- Latency: P50 < 250ms, P95 < 900ms end-to-end per message (including preprocessing, retrieval, model inference, and postprocessing).
- Cost: ≤ $0.002 per message at peak (3,500 msg/min). Must support burst traffic.
- Reliability: 99.9% availability; graceful degradation when the LLM is slow/unavailable.
- Compliance: GDPR + PCI-like constraints; no raw PAN storage; redact sensitive data before any logging.
- Safety: Prevent prompt injection and tool misuse (e.g., user tries to get internal policy text or trigger privileged actions).
Requirements (Deliverables)
- Propose an architecture for real-time triage using an LLM (or hybrid approach) that meets the latency/cost/SLA constraints.
- Define a preprocessing pipeline (language detection, redaction, normalization) and justify each step.
- Choose a modeling approach for (a) routing classification and (b) entity extraction; explain trade-offs vs alternatives.
- Provide an implementation sketch in Python using
transformers + spaCy (and optional lightweight retrieval).
- Describe an evaluation plan covering offline metrics, latency/load testing, and production monitoring (including drift and safety).
- Explain at least 5 concrete challenges of using LLMs in real-time applications (e.g., tail latency, context window, caching, streaming, determinism, prompt injection, observability, multilingual, hallucinations) and how your design mitigates them.