Route Fintech Chats: Prompt vs Fine-Tune

Business Context

MercuryPay is a fintech app with 18M monthly active users offering debit cards, P2P transfers, and small-business checking. The company processes ~120,000 customer-support chat conversations per day. A new policy requires the system to reliably detect and route high-risk issues (e.g., account takeover, chargeback fraud, AML-related questions) to a specialized queue within 30 seconds, while routine issues (password reset, card shipping, fee questions) should be handled by a lower-cost automation flow.

You’re asked to design an LLM-based routing system and explicitly decide when to use prompt engineering vs fine-tuning (and potentially a hybrid). The current baseline is a TF-IDF + linear classifier that struggles with emerging fraud phrasing and new product features, causing misroutes that increase fraud losses and violate internal SLAs.

Data Characteristics

Volume: 9.5M historical chat transcripts (last 12 months); 2.1M have high-quality agent-applied labels.
Text structure: Each transcript contains 3–30 turns (median 9). You may classify at the conversation level using the most recent N turns.
Length: 40–1,200 tokens (median ~220 tokens) after removing boilerplate.
Language: 88% English, 8% Spanish, 4% mixed/other.
Labels (multi-class, mutually exclusive):

Route	Approx. Share	Notes
Account Takeover (ATO)	2.5%	Highest severity; false negatives are costly
Card / Chargeback Fraud	4.0%	Often ambiguous language, evolving patterns
Payments / Transfer Issues	18%	Includes failed transfers, pending transactions
KYC / AML / Compliance	3.5%	Regulated; requires careful handling
Account Access / Password	22%	High volume, repetitive
Fees / Pricing / Disputes	20%	Often sentiment-heavy
General / Other	30%	Catch-all; noisy

Success Criteria

ATO recall ≥ 97% on a temporally held-out test set (fraud loss prevention).
Macro-F1 ≥ 0.84 across all routes.
P95 latency ≤ 250 ms per conversation on a single A10 GPU (batching allowed).
Auditability: Provide a rationale trace (e.g., top contributing spans or retrieved policy snippets) suitable for internal compliance review.

Constraints

Data residency: Training and inference must run in MercuryPay’s VPC; no third-party hosted APIs.
PII: Chats contain emails, phone numbers, last-4 SSN, and transaction IDs; must be redacted before storage in training artifacts.
Concept drift: Fraud language changes weekly; product launches change support topics.

Requirements (Deliverables)

Explain the key differences between prompt engineering and fine-tuning in this setting: what changes (weights vs instructions), data needs, iteration speed, failure modes, and governance.
Propose a decision framework: for each label group, justify whether you’d start with prompting, fine-tuning, or a hybrid (e.g., prompt + small LoRA adapter).
Provide a baseline implementation that supports both:
- (a) a prompt-based classifier (few-shot) using a local instruction-tuned model, and
- (b) a fine-tuned classifier (LoRA/PEFT) on labeled chats.
Define an evaluation plan including temporal validation, class-imbalance handling, and error analysis focused on ATO false negatives.
Describe how you would monitor drift and decide when to re-prompt vs re-train.

Business Context

Data Characteristics

Volume: 9.5M historical chat transcripts (last 12 months); 2.1M have high-quality agent-applied labels.
Text structure: Each transcript contains 3–30 turns (median 9). You may classify at the conversation level using the most recent N turns.
Length: 40–1,200 tokens (median ~220 tokens) after removing boilerplate.
Language: 88% English, 8% Spanish, 4% mixed/other.
Labels (multi-class, mutually exclusive):

Route	Approx. Share	Notes
Account Takeover (ATO)	2.5%	Highest severity; false negatives are costly
Card / Chargeback Fraud	4.0%	Often ambiguous language, evolving patterns
Payments / Transfer Issues	18%	Includes failed transfers, pending transactions
KYC / AML / Compliance	3.5%	Regulated; requires careful handling
Account Access / Password	22%	High volume, repetitive
Fees / Pricing / Disputes	20%	Often sentiment-heavy
General / Other	30%	Catch-all; noisy

Success Criteria

ATO recall ≥ 97% on a temporally held-out test set (fraud loss prevention).
Macro-F1 ≥ 0.84 across all routes.
P95 latency ≤ 250 ms per conversation on a single A10 GPU (batching allowed).
Auditability: Provide a rationale trace (e.g., top contributing spans or retrieved policy snippets) suitable for internal compliance review.

Constraints

Data residency: Training and inference must run in MercuryPay’s VPC; no third-party hosted APIs.
PII: Chats contain emails, phone numbers, last-4 SSN, and transaction IDs; must be redacted before storage in training artifacts.
Concept drift: Fraud language changes weekly; product launches change support topics.

Requirements (Deliverables)

Explain the key differences between prompt engineering and fine-tuning in this setting: what changes (weights vs instructions), data needs, iteration speed, failure modes, and governance.
Propose a decision framework: for each label group, justify whether you’d start with prompting, fine-tuning, or a hybrid (e.g., prompt + small LoRA adapter).
Provide a baseline implementation that supports both:
- (a) a prompt-based classifier (few-shot) using a local instruction-tuned model, and
- (b) a fine-tuned classifier (LoRA/PEFT) on labeled chats.
Define an evaluation plan including temporal validation, class-imbalance handling, and error analysis focused on ATO false negatives.
Describe how you would monitor drift and decide when to re-prompt vs re-train.

Business Context

Data Characteristics

Volume: 9.5M historical chat transcripts (last 12 months); 2.1M have high-quality agent-applied labels.
Text structure: Each transcript contains 3–30 turns (median 9). You may classify at the conversation level using the most recent N turns.
Length: 40–1,200 tokens (median ~220 tokens) after removing boilerplate.
Language: 88% English, 8% Spanish, 4% mixed/other.
Labels (multi-class, mutually exclusive):

Route	Approx. Share	Notes
Account Takeover (ATO)	2.5%	Highest severity; false negatives are costly
Card / Chargeback Fraud	4.0%	Often ambiguous language, evolving patterns
Payments / Transfer Issues	18%	Includes failed transfers, pending transactions
KYC / AML / Compliance	3.5%	Regulated; requires careful handling
Account Access / Password	22%	High volume, repetitive
Fees / Pricing / Disputes	20%	Often sentiment-heavy
General / Other	30%	Catch-all; noisy

Success Criteria

ATO recall ≥ 97% on a temporally held-out test set (fraud loss prevention).
Macro-F1 ≥ 0.84 across all routes.
P95 latency ≤ 250 ms per conversation on a single A10 GPU (batching allowed).
Auditability: Provide a rationale trace (e.g., top contributing spans or retrieved policy snippets) suitable for internal compliance review.

Constraints

Data residency: Training and inference must run in MercuryPay’s VPC; no third-party hosted APIs.
PII: Chats contain emails, phone numbers, last-4 SSN, and transaction IDs; must be redacted before storage in training artifacts.
Concept drift: Fraud language changes weekly; product launches change support topics.

Requirements (Deliverables)

Explain the key differences between prompt engineering and fine-tuning in this setting: what changes (weights vs instructions), data needs, iteration speed, failure modes, and governance.
Propose a decision framework: for each label group, justify whether you’d start with prompting, fine-tuning, or a hybrid (e.g., prompt + small LoRA adapter).
Provide a baseline implementation that supports both:
- (a) a prompt-based classifier (few-shot) using a local instruction-tuned model, and
- (b) a fine-tuned classifier (LoRA/PEFT) on labeled chats.
Define an evaluation plan including temporal validation, class-imbalance handling, and error analysis focused on ATO false negatives.
Describe how you would monitor drift and decide when to re-prompt vs re-train.

Business Context

Data Characteristics

Volume: 9.5M historical chat transcripts (last 12 months); 2.1M have high-quality agent-applied labels.
Text structure: Each transcript contains 3–30 turns (median 9). You may classify at the conversation level using the most recent N turns.
Length: 40–1,200 tokens (median ~220 tokens) after removing boilerplate.
Language: 88% English, 8% Spanish, 4% mixed/other.
Labels (multi-class, mutually exclusive):

Route	Approx. Share	Notes
Account Takeover (ATO)	2.5%	Highest severity; false negatives are costly
Card / Chargeback Fraud	4.0%	Often ambiguous language, evolving patterns
Payments / Transfer Issues	18%	Includes failed transfers, pending transactions
KYC / AML / Compliance	3.5%	Regulated; requires careful handling
Account Access / Password	22%	High volume, repetitive
Fees / Pricing / Disputes	20%	Often sentiment-heavy
General / Other	30%	Catch-all; noisy

Success Criteria

ATO recall ≥ 97% on a temporally held-out test set (fraud loss prevention).
Macro-F1 ≥ 0.84 across all routes.
P95 latency ≤ 250 ms per conversation on a single A10 GPU (batching allowed).
Auditability: Provide a rationale trace (e.g., top contributing spans or retrieved policy snippets) suitable for internal compliance review.

Constraints

Data residency: Training and inference must run in MercuryPay’s VPC; no third-party hosted APIs.
PII: Chats contain emails, phone numbers, last-4 SSN, and transaction IDs; must be redacted before storage in training artifacts.
Concept drift: Fraud language changes weekly; product launches change support topics.

Requirements (Deliverables)

Explain the key differences between prompt engineering and fine-tuning in this setting: what changes (weights vs instructions), data needs, iteration speed, failure modes, and governance.
Propose a decision framework: for each label group, justify whether you’d start with prompting, fine-tuning, or a hybrid (e.g., prompt + small LoRA adapter).
Provide a baseline implementation that supports both:
- (a) a prompt-based classifier (few-shot) using a local instruction-tuned model, and
- (b) a fine-tuned classifier (LoRA/PEFT) on labeled chats.
Define an evaluation plan including temporal validation, class-imbalance handling, and error analysis focused on ATO false negatives.
Describe how you would monitor drift and decide when to re-prompt vs re-train.

Interview Guides

Business Context

Data Characteristics

Success Criteria

Constraints

Requirements (Deliverables)

Route Fintech Chats: Prompt vs Fine-Tune

Business Context

Data Characteristics

Success Criteria

Constraints

Requirements (Deliverables)

Your Answer

Route Fintech Chats: Prompt vs Fine-Tune

Business Context

Data Characteristics

Success Criteria

Constraints

Requirements (Deliverables)

Route Fintech Chats: Prompt vs Fine-Tune

Business Context

Data Characteristics

Success Criteria

Constraints

Requirements (Deliverables)

Your Answer