Context
You’re interviewing for a Senior Data Engineer role on Salesforce’s Global Service Reliability (GSR) org. GSR owns a real-time dashboard used by on-call SREs, incident commanders, and customer support leadership to monitor Salesforce service health across all regions (NA, EMEA, APAC) and products (Service Cloud, Sales Cloud, Platform).
Today, most operational reporting is built from a mix of Prometheus/Grafana (metrics), Splunk (logs), and a 15-minute delayed batch ETL into Snowflake for executive reporting. During major incidents (e.g., a regional auth outage), teams struggle to answer basic questions quickly and consistently: “Is this global or regional?”, “Which tenants are impacted?”, “Is the error rate actually improving?”, and “Are we violating SLAs for premium customers?” The current setup also has inconsistent definitions of health metrics across teams.
Your task is to design a production-grade, real-time data pipeline that powers a unified Service Health Dashboard with consistent metrics, low latency, and strong correctness guarantees.
Scale Requirements
- Event throughput:
- Logs: 1.5M events/sec peak, 400K/sec average (JSON logs from edge, API gateways, app servers)
- Metrics: 250K samples/sec (Prometheus remote write)
- Traces (optional): 50K spans/sec (OpenTelemetry)
- Latency SLO:
- Dashboard freshness: P95 < 60 seconds from event time to queryable aggregates
- Incident drill-down (per tenant / per endpoint): P95 query < 2 seconds for last 15 minutes
- Retention:
- Raw logs: 14 days hot, 90 days cold
- Aggregates: 13 months
- Geography: 3 major regions, multi-AZ per region; must tolerate a full AZ failure.
Data Characteristics
Sources
- API Gateway logs (JSON): request_id, tenant_id, region, endpoint, http_status, latency_ms, bytes_out, user_agent, timestamp
- Auth service logs: login failures, token minting latency, dependency errors
- Prometheus metrics: request_rate, error_rate, saturation, queue_depth
- Status signals: deploy events (Spinnaker/Argo), feature flag changes, incident tickets (ServiceNow)
Common Issues
- Late arriving data: up to 10 minutes late due to network partitions or regional buffering
- Duplicates: retries can produce duplicate log lines (same request_id)
- Schema drift: teams add/remove fields; some services emit malformed JSON during failure modes
- Skew: a few large tenants generate disproportionate traffic; some endpoints are extremely hot
Requirements
Functional Requirements
- Compute real-time health KPIs by region, product, service, endpoint, and tenant tier:
- Availability proxy (% successful requests)
- Error rate (5xx, auth failures)
- Latency (P50/P95/P99)
- Saturation signals (queue depth, CPU throttling)
- Support incident drill-down: last 15 minutes by tenant_id and endpoint with near-real-time updates.
- Provide consistent metric definitions across teams (single source of truth).
- Handle late events (≤10 minutes) with correct windowed aggregations and updates.
- Provide a backfill mechanism for reprocessing the last 24 hours when a bug is found.
Non-Functional Requirements
- Exactly-once or effectively-once aggregates (no double counting in KPIs).
- High availability: pipeline continues during single-node/AZ failures.
- Observability: end-to-end lineage, lag, and data quality metrics.
- Security & compliance: tenant_id is sensitive; enforce least privilege and audit access.
- Cost constraint: incremental monthly spend target <$120K across streaming compute + warehouse.
Constraints
- Existing investments:
- Kafka is already used internally; teams are comfortable with it.
- Snowflake is the enterprise warehouse for analytics and executive reporting.
- Airflow is the standard orchestrator.
- You must support both:
- Real-time serving for dashboards
- Warehouse-grade history for weekly/monthly SLA reporting
- You cannot require every service team to change their logging format immediately; schema evolution must be supported.
What You Should Produce (as the candidate)
- A complete architecture (components + data flow) for ingestion, stream processing, storage, and serving.
- A data model for raw + curated + aggregate layers.
- Clear strategies for late data, deduplication, idempotency, and backfills.
- Monitoring/alerting plan with concrete thresholds.
- Failure modes and recovery strategies.
- Performance optimizations for both streaming compute and dashboard query latency.
You may assume AWS as the underlying cloud, but your design should be portable in principle.