
You’re interviewing for a Senior Data Engineer role on Salesforce’s Global Service Reliability (GSR) org. GSR owns a real-time dashboard used by on-call SREs, incident commanders, and customer support leadership to monitor Salesforce service health across all regions (NA, EMEA, APAC) and products (Service Cloud, Sales Cloud, Platform).
Today, most operational reporting is built from a mix of Prometheus/Grafana (metrics), Splunk (logs), and a 15-minute delayed batch ETL into Snowflake for executive reporting. During major incidents (e.g., a regional auth outage), teams struggle to answer basic questions quickly and consistently: “Is this global or regional?”, “Which tenants are impacted?”, “Is the error rate actually improving?”, and “Are we violating SLAs for premium customers?” The current setup also has inconsistent definitions of health metrics across teams.
Your task is to design a production-grade, real-time data pipeline that powers a unified Service Health Dashboard with consistent metrics, low latency, and strong correctness guarantees.
You may assume AWS as the underlying cloud, but your design should be portable in principle.
You’re interviewing for a Senior Data Engineer role on Salesforce’s Global Service Reliability (GSR) org. GSR owns a real-time dashboard used by on-call SREs, incident commanders, and customer support leadership to monitor Salesforce service health across all regions (NA, EMEA, APAC) and products (Service Cloud, Sales Cloud, Platform).
Today, most operational reporting is built from a mix of Prometheus/Grafana (metrics), Splunk (logs), and a 15-minute delayed batch ETL into Snowflake for executive reporting. During major incidents (e.g., a regional auth outage), teams struggle to answer basic questions quickly and consistently: “Is this global or regional?”, “Which tenants are impacted?”, “Is the error rate actually improving?”, and “Are we violating SLAs for premium customers?” The current setup also has inconsistent definitions of health metrics across teams.
Your task is to design a production-grade, real-time data pipeline that powers a unified Service Health Dashboard with consistent metrics, low latency, and strong correctness guarantees.
You may assume AWS as the underlying cloud, but your design should be portable in principle.