CodeShield runs a static application security testing (SAST) model that classifies code findings as either actionable vulnerabilities or benign findings. Security engineers report that too many alerts are false positives, causing developers to ignore the tool and delaying releases.
The model was evaluated on a labeled validation set of 12,000 findings from Java, Python, and JavaScript repositories.
| Metric | Current Model | Previous Model | Change |
|---|---|---|---|
| Precision | 0.41 | 0.58 | -0.17 |
| Recall | 0.86 | 0.74 | +0.12 |
| F1 Score | 0.56 | 0.65 | -0.09 |
| Accuracy | 0.78 | 0.84 | -0.06 |
| False Positive Rate | 0.19 | 0.09 | +0.10 |
| Alerts per 1,000 PRs | 320 | 190 | +130 |
| Developer dismissal rate | 61% | 38% | +23 pts |
The security team values high recall because missed vulnerabilities are costly, but the current false positive volume is overwhelming triage capacity and reducing trust in the tool. You need to evaluate whether the model is acceptable, diagnose where false positives are concentrated, and recommend how to reduce them without sharply increasing false negatives.