You are given a list of evaluation examples for an LLM application. Each example contains a prompt, a reference answer, a model answer, and a binary human label indicating whether the answer is acceptable. Implement code to compute: exact match, token-level F1, acceptance rate, and the confusion matrix for a rule-based evaluator that predicts accept/reject using a configurable lexical-overlap threshold. Then write a function that sweeps thresholds and returns the threshold with the best F1 against the human labels. Discuss one limitation of these metrics for evaluating open-ended model behavior.