Track Service Availability for Cloud Platform

Business Context

NimbusOps runs a B2B workflow automation platform used by 2,400 enterprise customers. Over the last 30 days, several customers reported intermittent outages even though the engineering dashboard still showed 99.95% uptime, so leadership wants a clearer framework for measuring service availability and overall system health.

Metric Scenario

In the past month, NimbusOps processed 48M API requests, 1.8M scheduled jobs, and 620k user login sessions. There were 3 major incidents: one 22-minute API outage in us-east-1, one 47-minute database degradation that increased p95 latency from 280 ms to 2,400 ms, and one authentication issue that caused login failures to spike from 0.6% to 8.9% for 90 minutes. Customer success says 14 strategic accounts were impacted, while the current executive report only includes a single uptime number.

Requirements

Define the core metrics you would track for service availability and service health.
Distinguish between customer-facing availability metrics and internal health/early-warning metrics.
Explain how you would calculate each metric and what should count as downtime or degradation.
Propose a metric hierarchy for executives, engineering managers, and on-call teams.
Identify likely decomposition cuts to diagnose whether issues are isolated or systemic.

Data Available

Data Source	Description	Granularity
request_logs	API request status, latency, region, endpoint, tenant_id	Per request
incident_log	Incident start/end, severity, impacted services, root cause	Per incident
auth_events	Login attempts, success/failure reason, region, device	Per event
job_runs	Scheduled job start/end, success/failure, queue delay	Per job
synthetic_checks	External uptime probes by region and service	Per check
customer_accounts	Plan tier, SLA tier, region, ARR band	Per account

Business Context

Metric Scenario

Requirements

Define the core metrics you would track for service availability and service health.
Distinguish between customer-facing availability metrics and internal health/early-warning metrics.
Explain how you would calculate each metric and what should count as downtime or degradation.
Propose a metric hierarchy for executives, engineering managers, and on-call teams.
Identify likely decomposition cuts to diagnose whether issues are isolated or systemic.

Data Available

Data Source	Description	Granularity
request_logs	API request status, latency, region, endpoint, tenant_id	Per request
incident_log	Incident start/end, severity, impacted services, root cause	Per incident
auth_events	Login attempts, success/failure reason, region, device	Per event
job_runs	Scheduled job start/end, success/failure, queue delay	Per job
synthetic_checks	External uptime probes by region and service	Per check
customer_accounts	Plan tier, SLA tier, region, ARR band	Per account

Business Context

Metric Scenario

Requirements

Define the core metrics you would track for service availability and service health.
Distinguish between customer-facing availability metrics and internal health/early-warning metrics.
Explain how you would calculate each metric and what should count as downtime or degradation.
Propose a metric hierarchy for executives, engineering managers, and on-call teams.
Identify likely decomposition cuts to diagnose whether issues are isolated or systemic.

Data Available

Data Source	Description	Granularity
request_logs	API request status, latency, region, endpoint, tenant_id	Per request
incident_log	Incident start/end, severity, impacted services, root cause	Per incident
auth_events	Login attempts, success/failure reason, region, device	Per event
job_runs	Scheduled job start/end, success/failure, queue delay	Per job
synthetic_checks	External uptime probes by region and service	Per check
customer_accounts	Plan tier, SLA tier, region, ARR band	Per account

Business Context

Metric Scenario

Requirements

Define the core metrics you would track for service availability and service health.
Distinguish between customer-facing availability metrics and internal health/early-warning metrics.
Explain how you would calculate each metric and what should count as downtime or degradation.
Propose a metric hierarchy for executives, engineering managers, and on-call teams.
Identify likely decomposition cuts to diagnose whether issues are isolated or systemic.

Data Available

Data Source	Description	Granularity
request_logs	API request status, latency, region, endpoint, tenant_id	Per request
incident_log	Incident start/end, severity, impacted services, root cause	Per incident
auth_events	Login attempts, success/failure reason, region, device	Per event
job_runs	Scheduled job start/end, success/failure, queue delay	Per job
synthetic_checks	External uptime probes by region and service	Per check
customer_accounts	Plan tier, SLA tier, region, ARR band	Per account

Interview Guides

Business Context

Metric Scenario

Requirements

Data Available

Track Service Availability for Cloud Platform

Business Context

Metric Scenario

Requirements

Data Available

Your Answer

Track Service Availability for Cloud Platform

Business Context

Metric Scenario

Requirements

Data Available

Track Service Availability for Cloud Platform

Business Context

Metric Scenario

Requirements

Data Available

Your Answer