ShopFlow uses a large language model to draft customer-support replies for order, refund, and shipping questions. The team wants a prompt-engineering solution that improves answer quality and consistency without fine-tuning a new model.
You have 120,000 historical support conversations, 18,000 human-written gold responses, and a prompt test set of 5,000 recent tickets. Messages are primarily English, with 8% Spanish routed through translation. Ticket length ranges from 20 to 900 tokens, with a median of 140 tokens. Common intents include order status (35%), returns/refunds (25%), shipping delays (20%), account issues (10%), and policy questions (10%). Labels for evaluation include intent, factual correctness, tone compliance, and resolution status.
A good solution should increase factual correctness and policy compliance while reducing hallucinated refund promises. Target at least 15% improvement in human preference score over the current baseline prompt, with p95 latency under 2 seconds and stable behavior across major ticket types.