Microsoft Teams uses a binary classifier to detect abusive chat messages and route high-risk messages to automated enforcement or human review. The model is a fine-tuned gradient boosting classifier that outputs a probability score, but the team is debating which decision threshold to use before rolling out to all enterprise tenants.
Validation set size: 200,000 messages with 4,000 abusive messages (2.0% prevalence).
| Threshold | Precision | Recall | F1 | False Positive Rate | Messages Flagged | True Positives | False Positives | False Negatives |
|---|---|---|---|---|---|---|---|---|
| 0.30 | 0.32 | 0.90 | 0.47 | 3.88% | 11,250 | 3,600 | 7,650 | 400 |
| 0.50 | 0.51 | 0.78 | 0.62 | 1.56% | 6,118 | 3,120 | 2,998 | 880 |
| 0.70 | 0.71 | 0.55 | 0.62 | 0.58% | 3,099 | 2,200 | 899 | 1,800 |
| 0.85 | 0.84 | 0.31 | 0.45 | 0.12% | 1,479 | 1,240 | 239 | 2,760 |
Additional model metrics on the same validation set:
| Metric | Value |
|---|---|
| AUC-ROC | 0.93 |
| PR-AUC | 0.68 |
| Log Loss | 0.118 |
| Brier Score | 0.072 |
The Trust & Safety team wants high recall to reduce harmful content exposure, while enterprise customers are sensitive to false positives because incorrect enforcement can block legitimate workplace communication. You need to recommend an operating threshold and explain the tradeoff.