Trace Request Telemetry Pipeline

Context

Databricks runs containerized internal services on Kubernetes and wants a pipeline that reconstructs the end-to-end network path of requests hitting a service inside the cluster. Today, packet captures and pod logs are inspected manually, which is too slow for incident response and makes it hard to explain failures across ingress, service routing, and pod-to-pod hops.

Design a telemetry pipeline that ingests Kubernetes, Databricks Lakehouse Monitoring, and network observability signals to build a near-real-time request path dataset for debugging and analytics.

Scale Requirements

Cluster size: 250 Kubernetes nodes, 6,000 pods, 900 services
Traffic: 120K HTTP/gRPC requests/sec peak, 25K avg
Telemetry volume: ~8 TB/day raw logs, traces, and flow records
Latency target: request-path records queryable in < 2 minutes from request completion
Retention: 30 days raw Bronze, 180 days curated Silver/Gold

Requirements

Ingest request telemetry from Databricks-managed Kubernetes ingress, service mesh or CNI flow logs, Kubernetes API events, and application traces.
Reconstruct the request path for each request: external load balancer/ingress Kubernetes Service kube-proxy/eBPF routing target Pod/container.
Correlate records using request IDs, source/destination IP:port, pod metadata, and time windows.
Produce Delta tables for:
- raw telemetry
- normalized network hops
- request-level path summaries
- failed or ambiguous correlations
Support backfills for historical incident windows and idempotent reprocessing.
Add data quality checks for missing hop segments, clock skew, duplicate trace IDs, and schema drift.
Orchestrate streaming and batch recovery using Databricks Workflows.

Constraints

Use the Databricks Lakehouse as the primary storage and serving layer.
Minimize operational complexity; assume a small platform team of 3 engineers.
Some telemetry arrives late by up to 15 minutes.
Must avoid storing sensitive payload bodies; only metadata is allowed.
Budget favors managed Databricks services over self-managed Kafka/Spark infrastructure.

Context

Design a telemetry pipeline that ingests Kubernetes, Databricks Lakehouse Monitoring, and network observability signals to build a near-real-time request path dataset for debugging and analytics.

Scale Requirements

Cluster size: 250 Kubernetes nodes, 6,000 pods, 900 services
Traffic: 120K HTTP/gRPC requests/sec peak, 25K avg
Telemetry volume: ~8 TB/day raw logs, traces, and flow records
Latency target: request-path records queryable in < 2 minutes from request completion
Retention: 30 days raw Bronze, 180 days curated Silver/Gold

Requirements

Ingest request telemetry from Databricks-managed Kubernetes ingress, service mesh or CNI flow logs, Kubernetes API events, and application traces.
Reconstruct the request path for each request: external load balancer/ingress Kubernetes Service kube-proxy/eBPF routing target Pod/container.
Correlate records using request IDs, source/destination IP:port, pod metadata, and time windows.
Produce Delta tables for:
- raw telemetry
- normalized network hops
- request-level path summaries
- failed or ambiguous correlations
Support backfills for historical incident windows and idempotent reprocessing.
Add data quality checks for missing hop segments, clock skew, duplicate trace IDs, and schema drift.
Orchestrate streaming and batch recovery using Databricks Workflows.

Constraints

Use the Databricks Lakehouse as the primary storage and serving layer.
Minimize operational complexity; assume a small platform team of 3 engineers.
Some telemetry arrives late by up to 15 minutes.
Must avoid storing sensitive payload bodies; only metadata is allowed.
Budget favors managed Databricks services over self-managed Kafka/Spark infrastructure.

Context

Design a telemetry pipeline that ingests Kubernetes, Databricks Lakehouse Monitoring, and network observability signals to build a near-real-time request path dataset for debugging and analytics.

Scale Requirements

Cluster size: 250 Kubernetes nodes, 6,000 pods, 900 services
Traffic: 120K HTTP/gRPC requests/sec peak, 25K avg
Telemetry volume: ~8 TB/day raw logs, traces, and flow records
Latency target: request-path records queryable in < 2 minutes from request completion
Retention: 30 days raw Bronze, 180 days curated Silver/Gold

Requirements

Ingest request telemetry from Databricks-managed Kubernetes ingress, service mesh or CNI flow logs, Kubernetes API events, and application traces.
Reconstruct the request path for each request: external load balancer/ingress Kubernetes Service kube-proxy/eBPF routing target Pod/container.
Correlate records using request IDs, source/destination IP:port, pod metadata, and time windows.
Produce Delta tables for:
- raw telemetry
- normalized network hops
- request-level path summaries
- failed or ambiguous correlations
Support backfills for historical incident windows and idempotent reprocessing.
Add data quality checks for missing hop segments, clock skew, duplicate trace IDs, and schema drift.
Orchestrate streaming and batch recovery using Databricks Workflows.

Constraints

Use the Databricks Lakehouse as the primary storage and serving layer.
Minimize operational complexity; assume a small platform team of 3 engineers.
Some telemetry arrives late by up to 15 minutes.
Must avoid storing sensitive payload bodies; only metadata is allowed.
Budget favors managed Databricks services over self-managed Kafka/Spark infrastructure.

Context

Design a telemetry pipeline that ingests Kubernetes, Databricks Lakehouse Monitoring, and network observability signals to build a near-real-time request path dataset for debugging and analytics.

Scale Requirements

Cluster size: 250 Kubernetes nodes, 6,000 pods, 900 services
Traffic: 120K HTTP/gRPC requests/sec peak, 25K avg
Telemetry volume: ~8 TB/day raw logs, traces, and flow records
Latency target: request-path records queryable in < 2 minutes from request completion
Retention: 30 days raw Bronze, 180 days curated Silver/Gold

Requirements

Ingest request telemetry from Databricks-managed Kubernetes ingress, service mesh or CNI flow logs, Kubernetes API events, and application traces.
Reconstruct the request path for each request: external load balancer/ingress Kubernetes Service kube-proxy/eBPF routing target Pod/container.
Correlate records using request IDs, source/destination IP:port, pod metadata, and time windows.
Produce Delta tables for:
- raw telemetry
- normalized network hops
- request-level path summaries
- failed or ambiguous correlations
Support backfills for historical incident windows and idempotent reprocessing.
Add data quality checks for missing hop segments, clock skew, duplicate trace IDs, and schema drift.
Orchestrate streaming and batch recovery using Databricks Workflows.

Constraints

Use the Databricks Lakehouse as the primary storage and serving layer.
Minimize operational complexity; assume a small platform team of 3 engineers.
Some telemetry arrives late by up to 15 minutes.
Must avoid storing sensitive payload bodies; only metadata is allowed.
Budget favors managed Databricks services over self-managed Kafka/Spark infrastructure.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Trace Request Telemetry Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Trace Request Telemetry Pipeline

Context

Scale Requirements

Requirements

Constraints

Trace Request Telemetry Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer