You are building an AI assistant for an enterprise support platform that handles thousands of daily requests across ticket summarization, intent classification, entity extraction, and response drafting. Different customer segments, product lines, and compliance settings require different prompts, and prompt changes are currently managed manually, causing inconsistent outputs and difficult debugging. You have historical prompts, model outputs, human QA annotations, and ticket metadata, but labels are incomplete and prompt performance shifts as policies and product terminology change. The system must support rapid iteration, prompt versioning, offline evaluation, and controlled rollout without breaking downstream NLP workflows.
How would you design a prompt engineering system that can operate at scale across these NLP tasks, including how prompts are authored, selected, evaluated, monitored, and improved over time? Explain the architecture, preprocessing, experimentation approach, and how you would handle failure modes such as drift, hallucinations, and prompt regressions.