You’re the analytics lead embedded in the Security Engineering org at Stripe’s cloud infrastructure (tens of thousands of hosts and containers across AWS/GCP, supporting millions of merchant transactions/day). The company is preparing for a PCI DSS audit and has a board-level mandate to reduce “material security risk” after a recent industry-wide wave of ransomware exploiting known vulnerabilities.
Security leadership believes the patching program is too SLA-driven (e.g., “patch all Critical CVEs in 7 days”) and not risk-driven. Engineering teams complain that the current approach generates noisy queues and causes downtime for low-value systems, while genuinely risky exposures (internet-facing, exploitable, high-value assets) sometimes wait.
You are asked to design a metrics framework that prioritizes patches based on asset criticality and threat context, and to propose how to use the metric to drive weekly patching decisions.
Over the last 6 weeks:
The CISO wants a single KPI that answers: “Are we reducing real risk with the patch capacity we have?” The Head of Infrastructure wants a prioritization score that can be turned into an ordered backlog and measured week-over-week.
| Data Source | Description | Granularity |
|---|---|---|
| vuln_findings | Scanner findings with CVE, CVSS, detection time, affected package/version | Per asset-CVE |
| asset_inventory | Service ownership, environment, tier, data sensitivity, exposure | Per asset |
| threat_intel | Exploit maturity, EPSS, KEV flag, in-the-wild signals | Per CVE-day |
| patch_events | Patch deployed timestamps, method (reboot/live), change window | Per asset-CVE |
| security_events | WAF/IDS alerts, exploit attempt signatures, incident tickets | Per asset-day |
Define a North-Star KPI for risk-based patching that combines:
Provide a calculation approach that is implementable in a data warehouse and can be computed daily.
Decompose the KPI into actionable drivers so engineering teams can see why they are being asked to patch something.
Propose benchmarks/targets for the KPI (what “good” looks like) and how you would set them using baseline data.
Recommend an operating cadence: how the metric drives weekly patch prioritization, including guardrails (e.g., downtime, change failure rate).
Diagnose a counterintuitive outcome: how could the org improve “% Critical CVEs patched in 7 days” while the risk-based KPI worsens? List 3 plausible causes and the analyses you’d run.
Constraints: