Northstar Systems uses an internal LLM assistant for HR, IT, and finance workflows such as ticket routing, policy Q&A, and email drafting. Prompt quality is inconsistent across teams, so you need to design and evaluate a prompt optimization pipeline for task-specific enterprise use cases.
You have 180,000 historical prompt-response pairs collected from internal usage logs across 12 tasks. Inputs range from 20-1,200 words, with a median of 180 words. Text is primarily English, but about 9% includes mixed-language content, copied email threads, bullet lists, tables, and internal acronyms. Human preference labels are available for 35,000 examples, with pairwise rankings and task-specific quality annotations such as factuality, format compliance, and actionability. Label quality is uneven across teams.
A good solution improves task success rate by at least 12% over the current baseline prompts, reduces invalid or off-format outputs below 3%, and keeps median inference latency under 2 seconds. The system should generalize across tasks without requiring full model fine-tuning for every workflow.