Context
Databricks runs containerized internal services on Kubernetes and wants a pipeline that reconstructs the end-to-end network path of requests hitting a service inside the cluster. Today, packet captures and pod logs are inspected manually, which is too slow for incident response and makes it hard to explain failures across ingress, service routing, and pod-to-pod hops.
Design a telemetry pipeline that ingests Kubernetes, Databricks Lakehouse Monitoring, and network observability signals to build a near-real-time request path dataset for debugging and analytics.
Scale Requirements
- Cluster size: 250 Kubernetes nodes, 6,000 pods, 900 services
- Traffic: 120K HTTP/gRPC requests/sec peak, 25K avg
- Telemetry volume: ~8 TB/day raw logs, traces, and flow records
- Latency target: request-path records queryable in < 2 minutes from request completion
- Retention: 30 days raw Bronze, 180 days curated Silver/Gold
Requirements
- Ingest request telemetry from Databricks-managed Kubernetes ingress, service mesh or CNI flow logs, Kubernetes API events, and application traces.
- Reconstruct the request path for each request: external load balancer/ingress Kubernetes Service kube-proxy/eBPF routing target Pod/container.
- Correlate records using request IDs, source/destination IP:port, pod metadata, and time windows.
- Produce Delta tables for:
- raw telemetry
- normalized network hops
- request-level path summaries
- failed or ambiguous correlations
- Support backfills for historical incident windows and idempotent reprocessing.
- Add data quality checks for missing hop segments, clock skew, duplicate trace IDs, and schema drift.
- Orchestrate streaming and batch recovery using Databricks Workflows.
Constraints
- Use the Databricks Lakehouse as the primary storage and serving layer.
- Minimize operational complexity; assume a small platform team of 3 engineers.
- Some telemetry arrives late by up to 15 minutes.
- Must avoid storing sensitive payload bodies; only metadata is allowed.
- Budget favors managed Databricks services over self-managed Kafka/Spark infrastructure.