CivitasAI is preparing to deploy a text classification model that flags harmful user prompts before they reach a generative assistant. The model is a fine-tuned transformer binary classifier trained to detect harassment, self-harm encouragement, violent threats, and illegal-instruction requests.
An offline evaluation on a 50,000-example holdout set shows strong aggregate accuracy, but the Trust & Safety team is concerned that harmful misses remain too high for production launch.
| Metric | Validation Set | Target for Launch |
|---|---|---|
| Accuracy | 0.947 | >= 0.940 |
| Precision | 0.781 | >= 0.800 |
| Recall | 0.684 | >= 0.850 |
| F1 Score | 0.729 | >= 0.820 |
| AUC-ROC | 0.912 | >= 0.900 |
| False Positive Rate | 0.041 | <= 0.030 |
| Harmful prevalence | 12.0% | n/a |
The model looks acceptable on top-line metrics, but recall is materially below the launch threshold. Missing harmful prompts is considered more costly than over-blocking benign prompts, especially in high-risk categories such as self-harm and explicit violence.