NimbusOps runs a B2B workflow automation platform used by 2,400 enterprise customers. Over the last 30 days, several customers reported intermittent outages even though the engineering dashboard still showed 99.95% uptime, so leadership wants a clearer framework for measuring service availability and overall system health.
In the past month, NimbusOps processed 48M API requests, 1.8M scheduled jobs, and 620k user login sessions. There were 3 major incidents: one 22-minute API outage in us-east-1, one 47-minute database degradation that increased p95 latency from 280 ms to 2,400 ms, and one authentication issue that caused login failures to spike from 0.6% to 8.9% for 90 minutes. Customer success says 14 strategic accounts were impacted, while the current executive report only includes a single uptime number.
| Data Source | Description | Granularity |
|---|---|---|
| request_logs | API request status, latency, region, endpoint, tenant_id | Per request |
| incident_log | Incident start/end, severity, impacted services, root cause | Per incident |
| auth_events | Login attempts, success/failure reason, region, device | Per event |
| job_runs | Scheduled job start/end, success/failure, queue delay | Per job |
| synthetic_checks | External uptime probes by region and service | Per check |
| customer_accounts | Plan tier, SLA tier, region, ARR band | Per account |