
Write a small evaluation runner in Python that executes an agent over a dataset of prompts, captures outputs and latency, computes simple task metrics (for example exact-match or rubric-based pass/fail), and logs per-example plus aggregate results to MLflow. The runner should be modular so that the same interface could later support additional metrics such as faithfulness, groundedness, or LLM-as-Judge. After the coding portion, briefly explain in comments or docstring why Databricks is a strong platform for this workflow, referencing Mosaic AI, unified governance via Unity Catalog, model serving, and agentic system evaluation. Expected solution outline: define an evaluator loop, metric function registry, MLflow logging structure, robust exception handling, and concise reasoning about the Databricks platform advantage for GenAI productionization.