Real-Time Salesforce Service Health Dashboard

Context

You’re interviewing for a Senior Data Engineer role on Salesforce’s Global Service Reliability (GSR) org. GSR owns a real-time dashboard used by on-call SREs, incident commanders, and customer support leadership to monitor Salesforce service health across all regions (NA, EMEA, APAC) and products (Service Cloud, Sales Cloud, Platform).

Today, most operational reporting is built from a mix of Prometheus/Grafana (metrics), Splunk (logs), and a 15-minute delayed batch ETL into Snowflake for executive reporting. During major incidents (e.g., a regional auth outage), teams struggle to answer basic questions quickly and consistently: “Is this global or regional?”, “Which tenants are impacted?”, “Is the error rate actually improving?”, and “Are we violating SLAs for premium customers?” The current setup also has inconsistent definitions of health metrics across teams.

Your task is to design a production-grade, real-time data pipeline that powers a unified Service Health Dashboard with consistent metrics, low latency, and strong correctness guarantees.

Scale Requirements

Event throughput:
- Logs: 1.5M events/sec peak, 400K/sec average (JSON logs from edge, API gateways, app servers)
- Metrics: 250K samples/sec (Prometheus remote write)
- Traces (optional): 50K spans/sec (OpenTelemetry)
Latency SLO:
- Dashboard freshness: P95 < 60 seconds from event time to queryable aggregates
- Incident drill-down (per tenant / per endpoint): P95 query < 2 seconds for last 15 minutes
Retention:
- Raw logs: 14 days hot, 90 days cold
- Aggregates: 13 months
Geography: 3 major regions, multi-AZ per region; must tolerate a full AZ failure.

Data Characteristics

Sources

API Gateway logs (JSON): request_id, tenant_id, region, endpoint, http_status, latency_ms, bytes_out, user_agent, timestamp
Auth service logs: login failures, token minting latency, dependency errors
Prometheus metrics: request_rate, error_rate, saturation, queue_depth
Status signals: deploy events (Spinnaker/Argo), feature flag changes, incident tickets (ServiceNow)

Common Issues

Late arriving data: up to 10 minutes late due to network partitions or regional buffering
Duplicates: retries can produce duplicate log lines (same request_id)
Schema drift: teams add/remove fields; some services emit malformed JSON during failure modes
Skew: a few large tenants generate disproportionate traffic; some endpoints are extremely hot

Requirements

Functional Requirements

Compute real-time health KPIs by region, product, service, endpoint, and tenant tier:
- Availability proxy (% successful requests)
- Error rate (5xx, auth failures)
- Latency (P50/P95/P99)
- Saturation signals (queue depth, CPU throttling)
Support incident drill-down: last 15 minutes by tenant_id and endpoint with near-real-time updates.
Provide consistent metric definitions across teams (single source of truth).
Handle late events (≤10 minutes) with correct windowed aggregations and updates.
Provide a backfill mechanism for reprocessing the last 24 hours when a bug is found.

Non-Functional Requirements

Exactly-once or effectively-once aggregates (no double counting in KPIs).
High availability: pipeline continues during single-node/AZ failures.
Observability: end-to-end lineage, lag, and data quality metrics.
Security & compliance: tenant_id is sensitive; enforce least privilege and audit access.
Cost constraint: incremental monthly spend target <$120K across streaming compute + warehouse.

Constraints

Existing investments:
- Kafka is already used internally; teams are comfortable with it.
- Snowflake is the enterprise warehouse for analytics and executive reporting.
- Airflow is the standard orchestrator.
You must support both:
- Real-time serving for dashboards
- Warehouse-grade history for weekly/monthly SLA reporting
You cannot require every service team to change their logging format immediately; schema evolution must be supported.

What You Should Produce (as the candidate)

A complete architecture (components + data flow) for ingestion, stream processing, storage, and serving.
A data model for raw + curated + aggregate layers.
Clear strategies for late data, deduplication, idempotency, and backfills.
Monitoring/alerting plan with concrete thresholds.
Failure modes and recovery strategies.
Performance optimizations for both streaming compute and dashboard query latency.

You may assume AWS as the underlying cloud, but your design should be portable in principle.

Context

Your task is to design a production-grade, real-time data pipeline that powers a unified Service Health Dashboard with consistent metrics, low latency, and strong correctness guarantees.

Scale Requirements

Event throughput:
- Logs: 1.5M events/sec peak, 400K/sec average (JSON logs from edge, API gateways, app servers)
- Metrics: 250K samples/sec (Prometheus remote write)
- Traces (optional): 50K spans/sec (OpenTelemetry)
Latency SLO:
- Dashboard freshness: P95 < 60 seconds from event time to queryable aggregates
- Incident drill-down (per tenant / per endpoint): P95 query < 2 seconds for last 15 minutes
Retention:
- Raw logs: 14 days hot, 90 days cold
- Aggregates: 13 months
Geography: 3 major regions, multi-AZ per region; must tolerate a full AZ failure.

Data Characteristics

Sources

API Gateway logs (JSON): request_id, tenant_id, region, endpoint, http_status, latency_ms, bytes_out, user_agent, timestamp
Auth service logs: login failures, token minting latency, dependency errors
Prometheus metrics: request_rate, error_rate, saturation, queue_depth
Status signals: deploy events (Spinnaker/Argo), feature flag changes, incident tickets (ServiceNow)

Common Issues

Late arriving data: up to 10 minutes late due to network partitions or regional buffering
Duplicates: retries can produce duplicate log lines (same request_id)
Schema drift: teams add/remove fields; some services emit malformed JSON during failure modes
Skew: a few large tenants generate disproportionate traffic; some endpoints are extremely hot

Requirements

Functional Requirements

Compute real-time health KPIs by region, product, service, endpoint, and tenant tier:
- Availability proxy (% successful requests)
- Error rate (5xx, auth failures)
- Latency (P50/P95/P99)
- Saturation signals (queue depth, CPU throttling)
Support incident drill-down: last 15 minutes by tenant_id and endpoint with near-real-time updates.
Provide consistent metric definitions across teams (single source of truth).
Handle late events (≤10 minutes) with correct windowed aggregations and updates.
Provide a backfill mechanism for reprocessing the last 24 hours when a bug is found.

Non-Functional Requirements

Exactly-once or effectively-once aggregates (no double counting in KPIs).
High availability: pipeline continues during single-node/AZ failures.
Observability: end-to-end lineage, lag, and data quality metrics.
Security & compliance: tenant_id is sensitive; enforce least privilege and audit access.
Cost constraint: incremental monthly spend target <$120K across streaming compute + warehouse.

Constraints

Existing investments:
- Kafka is already used internally; teams are comfortable with it.
- Snowflake is the enterprise warehouse for analytics and executive reporting.
- Airflow is the standard orchestrator.
You must support both:
- Real-time serving for dashboards
- Warehouse-grade history for weekly/monthly SLA reporting
You cannot require every service team to change their logging format immediately; schema evolution must be supported.

What You Should Produce (as the candidate)

A complete architecture (components + data flow) for ingestion, stream processing, storage, and serving.
A data model for raw + curated + aggregate layers.
Clear strategies for late data, deduplication, idempotency, and backfills.
Monitoring/alerting plan with concrete thresholds.
Failure modes and recovery strategies.
Performance optimizations for both streaming compute and dashboard query latency.

You may assume AWS as the underlying cloud, but your design should be portable in principle.

Context

Your task is to design a production-grade, real-time data pipeline that powers a unified Service Health Dashboard with consistent metrics, low latency, and strong correctness guarantees.

Scale Requirements

Event throughput:
- Logs: 1.5M events/sec peak, 400K/sec average (JSON logs from edge, API gateways, app servers)
- Metrics: 250K samples/sec (Prometheus remote write)
- Traces (optional): 50K spans/sec (OpenTelemetry)
Latency SLO:
- Dashboard freshness: P95 < 60 seconds from event time to queryable aggregates
- Incident drill-down (per tenant / per endpoint): P95 query < 2 seconds for last 15 minutes
Retention:
- Raw logs: 14 days hot, 90 days cold
- Aggregates: 13 months
Geography: 3 major regions, multi-AZ per region; must tolerate a full AZ failure.

Data Characteristics

Sources

API Gateway logs (JSON): request_id, tenant_id, region, endpoint, http_status, latency_ms, bytes_out, user_agent, timestamp
Auth service logs: login failures, token minting latency, dependency errors
Prometheus metrics: request_rate, error_rate, saturation, queue_depth
Status signals: deploy events (Spinnaker/Argo), feature flag changes, incident tickets (ServiceNow)

Common Issues

Late arriving data: up to 10 minutes late due to network partitions or regional buffering
Duplicates: retries can produce duplicate log lines (same request_id)
Schema drift: teams add/remove fields; some services emit malformed JSON during failure modes
Skew: a few large tenants generate disproportionate traffic; some endpoints are extremely hot

Requirements

Functional Requirements

Compute real-time health KPIs by region, product, service, endpoint, and tenant tier:
- Availability proxy (% successful requests)
- Error rate (5xx, auth failures)
- Latency (P50/P95/P99)
- Saturation signals (queue depth, CPU throttling)
Support incident drill-down: last 15 minutes by tenant_id and endpoint with near-real-time updates.
Provide consistent metric definitions across teams (single source of truth).
Handle late events (≤10 minutes) with correct windowed aggregations and updates.
Provide a backfill mechanism for reprocessing the last 24 hours when a bug is found.

Non-Functional Requirements

Exactly-once or effectively-once aggregates (no double counting in KPIs).
High availability: pipeline continues during single-node/AZ failures.
Observability: end-to-end lineage, lag, and data quality metrics.
Security & compliance: tenant_id is sensitive; enforce least privilege and audit access.
Cost constraint: incremental monthly spend target <$120K across streaming compute + warehouse.

Constraints

Existing investments:
- Kafka is already used internally; teams are comfortable with it.
- Snowflake is the enterprise warehouse for analytics and executive reporting.
- Airflow is the standard orchestrator.
You must support both:
- Real-time serving for dashboards
- Warehouse-grade history for weekly/monthly SLA reporting
You cannot require every service team to change their logging format immediately; schema evolution must be supported.

What You Should Produce (as the candidate)

A complete architecture (components + data flow) for ingestion, stream processing, storage, and serving.
A data model for raw + curated + aggregate layers.
Clear strategies for late data, deduplication, idempotency, and backfills.
Monitoring/alerting plan with concrete thresholds.
Failure modes and recovery strategies.
Performance optimizations for both streaming compute and dashboard query latency.

You may assume AWS as the underlying cloud, but your design should be portable in principle.

Context

Your task is to design a production-grade, real-time data pipeline that powers a unified Service Health Dashboard with consistent metrics, low latency, and strong correctness guarantees.

Scale Requirements

Event throughput:
- Logs: 1.5M events/sec peak, 400K/sec average (JSON logs from edge, API gateways, app servers)
- Metrics: 250K samples/sec (Prometheus remote write)
- Traces (optional): 50K spans/sec (OpenTelemetry)
Latency SLO:
- Dashboard freshness: P95 < 60 seconds from event time to queryable aggregates
- Incident drill-down (per tenant / per endpoint): P95 query < 2 seconds for last 15 minutes
Retention:
- Raw logs: 14 days hot, 90 days cold
- Aggregates: 13 months
Geography: 3 major regions, multi-AZ per region; must tolerate a full AZ failure.

Data Characteristics

Sources

API Gateway logs (JSON): request_id, tenant_id, region, endpoint, http_status, latency_ms, bytes_out, user_agent, timestamp
Auth service logs: login failures, token minting latency, dependency errors
Prometheus metrics: request_rate, error_rate, saturation, queue_depth
Status signals: deploy events (Spinnaker/Argo), feature flag changes, incident tickets (ServiceNow)

Common Issues

Late arriving data: up to 10 minutes late due to network partitions or regional buffering
Duplicates: retries can produce duplicate log lines (same request_id)
Schema drift: teams add/remove fields; some services emit malformed JSON during failure modes
Skew: a few large tenants generate disproportionate traffic; some endpoints are extremely hot

Requirements

Functional Requirements

Compute real-time health KPIs by region, product, service, endpoint, and tenant tier:
- Availability proxy (% successful requests)
- Error rate (5xx, auth failures)
- Latency (P50/P95/P99)
- Saturation signals (queue depth, CPU throttling)
Support incident drill-down: last 15 minutes by tenant_id and endpoint with near-real-time updates.
Provide consistent metric definitions across teams (single source of truth).
Handle late events (≤10 minutes) with correct windowed aggregations and updates.
Provide a backfill mechanism for reprocessing the last 24 hours when a bug is found.

Non-Functional Requirements

Exactly-once or effectively-once aggregates (no double counting in KPIs).
High availability: pipeline continues during single-node/AZ failures.
Observability: end-to-end lineage, lag, and data quality metrics.
Security & compliance: tenant_id is sensitive; enforce least privilege and audit access.
Cost constraint: incremental monthly spend target <$120K across streaming compute + warehouse.

Constraints

Existing investments:
- Kafka is already used internally; teams are comfortable with it.
- Snowflake is the enterprise warehouse for analytics and executive reporting.
- Airflow is the standard orchestrator.
You must support both:
- Real-time serving for dashboards
- Warehouse-grade history for weekly/monthly SLA reporting
You cannot require every service team to change their logging format immediately; schema evolution must be supported.

What You Should Produce (as the candidate)

A complete architecture (components + data flow) for ingestion, stream processing, storage, and serving.
A data model for raw + curated + aggregate layers.
Clear strategies for late data, deduplication, idempotency, and backfills.
Monitoring/alerting plan with concrete thresholds.
Failure modes and recovery strategies.
Performance optimizations for both streaming compute and dashboard query latency.

You may assume AWS as the underlying cloud, but your design should be portable in principle.

Interview Guides

Context

Scale Requirements

Data Characteristics

Sources

Common Issues

Requirements

Functional Requirements

Non-Functional Requirements

Constraints

What You Should Produce (as the candidate)

Real-Time Salesforce Service Health Dashboard

Context

Scale Requirements

Data Characteristics

Sources

Common Issues

Requirements

Functional Requirements

Non-Functional Requirements

Constraints

What You Should Produce (as the candidate)

Your Answer

Real-Time Salesforce Service Health Dashboard

Context

Scale Requirements

Data Characteristics

Sources

Common Issues

Requirements

Functional Requirements

Non-Functional Requirements

Constraints

What You Should Produce (as the candidate)

Real-Time Salesforce Service Health Dashboard

Context

Scale Requirements

Data Characteristics

Sources

Common Issues

Requirements

Functional Requirements

Non-Functional Requirements

Constraints

What You Should Produce (as the candidate)

Your Answer