Business Context
HelpFlow, a SaaS customer support platform, uses a large language model to draft agent-assist replies for inbound support tickets. The team wants to understand and control the temperature parameter so responses remain accurate for billing and policy questions while still sounding natural for conversational requests.
Data
You are given 180,000 historical support prompts and reference replies from email and chat channels.
- Volume: 180K prompt-response pairs, plus 5K manually reviewed prompts for generation experiments
- Text length: 20-600 tokens per prompt; median 110 tokens
- Language: English only
- Prompt types: billing (28%), account access (24%), technical troubleshooting (31%), general product questions (17%)
- Label distribution for evaluation set: deterministic-safe, mildly creative, and unacceptable generations annotated by reviewers
The task is not to train a new base LLM, but to design an evaluation workflow that explains the role of temperature in token sampling and identifies an appropriate setting for this domain.
Success Criteria
A good solution clearly explains how temperature changes the next-token probability distribution, demonstrates the trade-off between determinism and diversity, and recommends temperature ranges by use case. The chosen setting should keep factual-support accuracy high while avoiding repetitive or robotic outputs.
Constraints
- Inference latency must stay under 800 ms per request
- The solution must run on a single GPU-backed inference service
- Billing and compliance responses should minimize hallucinations
- The evaluation should be reproducible across runs
Requirements
- Explain the role of temperature in LLM text generation.
- Build a Python experiment comparing outputs at multiple temperature values.
- Preprocess support prompts before generation.
- Evaluate diversity, consistency, and task quality across settings.
- Recommend production temperature defaults for different support scenarios.