You are building an NLP workflow for an enterprise support platform that uses an LLM to classify incoming tickets, extract structured fields, and draft agent responses from noisy customer text. You have about 300,000 historical tickets with partial labels, frequent taxonomy changes, long-tail issue types, and a mix of short chat messages and multi-paragraph email threads. The product team wants fast iteration on new workflows, but operations also needs stable outputs for high-volume queues where formatting and label consistency matter. You can use prompt engineering, retrieval over internal policy documents, and supervised fine-tuning, but labeled data quality varies across tasks.
How would you decide whether to rely on prompt engineering alone or invest in fine-tuning for each part of this workflow, and how would you design the NLP system, preprocessing pipeline, experimentation plan, and evaluation process to support that decision over time?