Real-Time LLM Triage for Fintech Chat

Business Context

StripeShield (a fictional fintech) provides fraud prevention and account support for 18M monthly active users across the US and EU. The in-app support chat handles ~220K conversations/day, and a subset (chargebacks, account takeover, card testing) requires real-time intervention: if the system can’t route the conversation to the right queue within seconds, fraud losses increase and legitimate users get locked out.

The team wants to use an LLM-driven triage service that (1) classifies each incoming user message into an operational category and urgency, and (2) extracts key entities (transaction IDs, amounts, merchants) to pre-fill internal tools. The service must run in production with predictable latency and strong safety/compliance guarantees.

Data Characteristics

Volume: 9.5M historical chat messages with human-applied routing labels; 60 days of logs available for near-real-time evaluation.
Text length: 3–900 tokens; median 42 tokens. Many messages are short, fragmented, and multi-turn.
Language: English (78%), Spanish (11%), Portuguese (6%), French (3%), other (2%). Code-switching appears in ~4% of non-English chats.
Domain vocabulary: chargeback reason codes, KYC/AML terms, “3DS”, “ACH return”, “merchant descriptor”, “BIN”, “dispute window”.
Label distribution (routing):

Label	Meaning	Proportion
ATO	Account takeover / login compromised	2.5%
FRAUD	Unauthorized transaction / card testing	6.0%
DISPUTE	Chargeback/dispute workflow	14.5%
PAYOUTS	Payout holds / reserves	7.0%
KYC	Identity verification	10.0%
BILLING	Subscription/invoice/payment method	35.0%
GENERAL	Everything else	25.0%

Success Criteria

Business: Reduce median time-to-correct-queue from 6 minutes to <30 seconds for ATO/FRAUD, and cut fraud loss by 0.5–1.0 bps.
Model: Macro-F1 ≥ 0.86 overall; Recall ≥ 0.95 on ATO (false negatives are high-risk).
Entity extraction: ≥ 0.90 F1 for transaction_id and amount on a labeled evaluation set.

Constraints (Real-Time LLM Challenges)

Latency: P50 < 250ms, P95 < 900ms end-to-end per message (including preprocessing, retrieval, model inference, and postprocessing).
Cost: ≤ $0.002 per message at peak (3,500 msg/min). Must support burst traffic.
Reliability: 99.9% availability; graceful degradation when the LLM is slow/unavailable.
Compliance: GDPR + PCI-like constraints; no raw PAN storage; redact sensitive data before any logging.
Safety: Prevent prompt injection and tool misuse (e.g., user tries to get internal policy text or trigger privileged actions).

Requirements (Deliverables)

Propose an architecture for real-time triage using an LLM (or hybrid approach) that meets the latency/cost/SLA constraints.
Define a preprocessing pipeline (language detection, redaction, normalization) and justify each step.
Choose a modeling approach for (a) routing classification and (b) entity extraction; explain trade-offs vs alternatives.
Provide an implementation sketch in Python using transformers + spaCy (and optional lightweight retrieval).
Describe an evaluation plan covering offline metrics, latency/load testing, and production monitoring (including drift and safety).
Explain at least 5 concrete challenges of using LLMs in real-time applications (e.g., tail latency, context window, caching, streaming, determinism, prompt injection, observability, multilingual, hallucinations) and how your design mitigates them.

Business Context

Data Characteristics

Volume: 9.5M historical chat messages with human-applied routing labels; 60 days of logs available for near-real-time evaluation.
Text length: 3–900 tokens; median 42 tokens. Many messages are short, fragmented, and multi-turn.
Language: English (78%), Spanish (11%), Portuguese (6%), French (3%), other (2%). Code-switching appears in ~4% of non-English chats.
Domain vocabulary: chargeback reason codes, KYC/AML terms, “3DS”, “ACH return”, “merchant descriptor”, “BIN”, “dispute window”.
Label distribution (routing):

Label	Meaning	Proportion
ATO	Account takeover / login compromised	2.5%
FRAUD	Unauthorized transaction / card testing	6.0%
DISPUTE	Chargeback/dispute workflow	14.5%
PAYOUTS	Payout holds / reserves	7.0%
KYC	Identity verification	10.0%
BILLING	Subscription/invoice/payment method	35.0%
GENERAL	Everything else	25.0%

Success Criteria

Business: Reduce median time-to-correct-queue from 6 minutes to <30 seconds for ATO/FRAUD, and cut fraud loss by 0.5–1.0 bps.
Model: Macro-F1 ≥ 0.86 overall; Recall ≥ 0.95 on ATO (false negatives are high-risk).
Entity extraction: ≥ 0.90 F1 for transaction_id and amount on a labeled evaluation set.

Constraints (Real-Time LLM Challenges)

Latency: P50 < 250ms, P95 < 900ms end-to-end per message (including preprocessing, retrieval, model inference, and postprocessing).
Cost: ≤ $0.002 per message at peak (3,500 msg/min). Must support burst traffic.
Reliability: 99.9% availability; graceful degradation when the LLM is slow/unavailable.
Compliance: GDPR + PCI-like constraints; no raw PAN storage; redact sensitive data before any logging.
Safety: Prevent prompt injection and tool misuse (e.g., user tries to get internal policy text or trigger privileged actions).

Requirements (Deliverables)

Propose an architecture for real-time triage using an LLM (or hybrid approach) that meets the latency/cost/SLA constraints.
Define a preprocessing pipeline (language detection, redaction, normalization) and justify each step.
Choose a modeling approach for (a) routing classification and (b) entity extraction; explain trade-offs vs alternatives.
Provide an implementation sketch in Python using transformers + spaCy (and optional lightweight retrieval).
Describe an evaluation plan covering offline metrics, latency/load testing, and production monitoring (including drift and safety).
Explain at least 5 concrete challenges of using LLMs in real-time applications (e.g., tail latency, context window, caching, streaming, determinism, prompt injection, observability, multilingual, hallucinations) and how your design mitigates them.

Business Context

Data Characteristics

Volume: 9.5M historical chat messages with human-applied routing labels; 60 days of logs available for near-real-time evaluation.
Text length: 3–900 tokens; median 42 tokens. Many messages are short, fragmented, and multi-turn.
Language: English (78%), Spanish (11%), Portuguese (6%), French (3%), other (2%). Code-switching appears in ~4% of non-English chats.
Domain vocabulary: chargeback reason codes, KYC/AML terms, “3DS”, “ACH return”, “merchant descriptor”, “BIN”, “dispute window”.
Label distribution (routing):

Label	Meaning	Proportion
ATO	Account takeover / login compromised	2.5%
FRAUD	Unauthorized transaction / card testing	6.0%
DISPUTE	Chargeback/dispute workflow	14.5%
PAYOUTS	Payout holds / reserves	7.0%
KYC	Identity verification	10.0%
BILLING	Subscription/invoice/payment method	35.0%
GENERAL	Everything else	25.0%

Success Criteria

Business: Reduce median time-to-correct-queue from 6 minutes to <30 seconds for ATO/FRAUD, and cut fraud loss by 0.5–1.0 bps.
Model: Macro-F1 ≥ 0.86 overall; Recall ≥ 0.95 on ATO (false negatives are high-risk).
Entity extraction: ≥ 0.90 F1 for transaction_id and amount on a labeled evaluation set.

Constraints (Real-Time LLM Challenges)

Latency: P50 < 250ms, P95 < 900ms end-to-end per message (including preprocessing, retrieval, model inference, and postprocessing).
Cost: ≤ $0.002 per message at peak (3,500 msg/min). Must support burst traffic.
Reliability: 99.9% availability; graceful degradation when the LLM is slow/unavailable.
Compliance: GDPR + PCI-like constraints; no raw PAN storage; redact sensitive data before any logging.
Safety: Prevent prompt injection and tool misuse (e.g., user tries to get internal policy text or trigger privileged actions).

Requirements (Deliverables)

Propose an architecture for real-time triage using an LLM (or hybrid approach) that meets the latency/cost/SLA constraints.
Define a preprocessing pipeline (language detection, redaction, normalization) and justify each step.
Choose a modeling approach for (a) routing classification and (b) entity extraction; explain trade-offs vs alternatives.
Provide an implementation sketch in Python using transformers + spaCy (and optional lightweight retrieval).
Describe an evaluation plan covering offline metrics, latency/load testing, and production monitoring (including drift and safety).
Explain at least 5 concrete challenges of using LLMs in real-time applications (e.g., tail latency, context window, caching, streaming, determinism, prompt injection, observability, multilingual, hallucinations) and how your design mitigates them.

Business Context

Data Characteristics

Volume: 9.5M historical chat messages with human-applied routing labels; 60 days of logs available for near-real-time evaluation.
Text length: 3–900 tokens; median 42 tokens. Many messages are short, fragmented, and multi-turn.
Language: English (78%), Spanish (11%), Portuguese (6%), French (3%), other (2%). Code-switching appears in ~4% of non-English chats.
Domain vocabulary: chargeback reason codes, KYC/AML terms, “3DS”, “ACH return”, “merchant descriptor”, “BIN”, “dispute window”.
Label distribution (routing):

Label	Meaning	Proportion
ATO	Account takeover / login compromised	2.5%
FRAUD	Unauthorized transaction / card testing	6.0%
DISPUTE	Chargeback/dispute workflow	14.5%
PAYOUTS	Payout holds / reserves	7.0%
KYC	Identity verification	10.0%
BILLING	Subscription/invoice/payment method	35.0%
GENERAL	Everything else	25.0%

Success Criteria

Business: Reduce median time-to-correct-queue from 6 minutes to <30 seconds for ATO/FRAUD, and cut fraud loss by 0.5–1.0 bps.
Model: Macro-F1 ≥ 0.86 overall; Recall ≥ 0.95 on ATO (false negatives are high-risk).
Entity extraction: ≥ 0.90 F1 for transaction_id and amount on a labeled evaluation set.

Constraints (Real-Time LLM Challenges)

Latency: P50 < 250ms, P95 < 900ms end-to-end per message (including preprocessing, retrieval, model inference, and postprocessing).
Cost: ≤ $0.002 per message at peak (3,500 msg/min). Must support burst traffic.
Reliability: 99.9% availability; graceful degradation when the LLM is slow/unavailable.
Compliance: GDPR + PCI-like constraints; no raw PAN storage; redact sensitive data before any logging.
Safety: Prevent prompt injection and tool misuse (e.g., user tries to get internal policy text or trigger privileged actions).

Requirements (Deliverables)

Propose an architecture for real-time triage using an LLM (or hybrid approach) that meets the latency/cost/SLA constraints.
Define a preprocessing pipeline (language detection, redaction, normalization) and justify each step.
Choose a modeling approach for (a) routing classification and (b) entity extraction; explain trade-offs vs alternatives.
Provide an implementation sketch in Python using transformers + spaCy (and optional lightweight retrieval).
Describe an evaluation plan covering offline metrics, latency/load testing, and production monitoring (including drift and safety).
Explain at least 5 concrete challenges of using LLMs in real-time applications (e.g., tail latency, context window, caching, streaming, determinism, prompt injection, observability, multilingual, hallucinations) and how your design mitigates them.

Interview Guides

Business Context

Data Characteristics

Success Criteria

Constraints (Real-Time LLM Challenges)

Requirements (Deliverables)

Real-Time LLM Triage for Fintech Chat

Business Context

Data Characteristics

Success Criteria

Constraints (Real-Time LLM Challenges)

Requirements (Deliverables)

Your Answer

Real-Time LLM Triage for Fintech Chat

Business Context

Data Characteristics

Success Criteria

Constraints (Real-Time LLM Challenges)

Requirements (Deliverables)

Real-Time LLM Triage for Fintech Chat

Business Context

Data Characteristics

Success Criteria

Constraints (Real-Time LLM Challenges)

Requirements (Deliverables)

Your Answer