You are building an evaluation harness for two chatbot variants served through Databricks Model Serving. Implement a Python evaluator that takes (prompt, candidate_a, candidate_b, rubric) and returns a structured judgment with winner, confidence, and rationale using an LLM-as-Judge API call. The evaluator must support deterministic prompt templating, retry logic, result caching keyed by normalized inputs, and aggregation of pairwise results into per-model win rates over a dataset. In your explanation, discuss prompt leakage risks, judge bias, and how you would wire this into MLflow Agent Evaluation on Databricks. Expected solution outline: define a judge prompt, wrap a Foundation Model API / DBRX call, parse structured output robustly, cache responses in a dictionary or persistent map, and compute summary metrics from pairwise comparisons.