Dataford
Interview Guides
Upgrade
All questions/Pipelines/Build Splunk Observability Log Pipeline

Build Splunk Observability Log Pipeline

Easy
Pipelines
Asked at 1 company1InfrastructureQualityTools
Also asked at
NVIDIA

Problem

Context

PulseCart, a mid-sized e-commerce company, runs customer-facing services on Kubernetes and currently ships application logs, infrastructure metrics, and deployment events into separate tools. The operations team wants a unified observability pipeline centered on Splunk so they can search logs, correlate incidents, detect failures faster, and support postmortems without relying on manual log collection.

You are asked to design a production-grade pipeline that ingests operational telemetry from microservices, databases, and Kubernetes clusters into Splunk for monitoring and troubleshooting, while also preserving raw data in low-cost storage for replay and audit.

Scale Requirements

  • Services: 120 microservices across 3 Kubernetes clusters
  • Log volume: 2 TB/day raw JSON logs
  • Metrics/events: 1.5M metric datapoints/minute, 50K deployment/audit events/hour
  • Peak throughput: 80K log events/second during incidents
  • Latency target: searchable in Splunk within 60 seconds
  • Retention: 30 days hot in Splunk, 180 days archived in S3

Requirements

  1. Ingest logs, metrics, and operational events from Kubernetes, application services, CI/CD, and PostgreSQL into a centralized pipeline.
  2. Normalize records into a common schema with fields such as service, env, cluster, trace_id, severity, and event_time.
  3. Route valid telemetry to Splunk for observability use cases including incident triage, alerting, dashboarding, and root-cause analysis.
  4. Persist raw and rejected records to Amazon S3 for replay, compliance, and backfills.
  5. Implement data quality checks for malformed JSON, missing required fields, duplicate deployment events, and timestamp drift.
  6. Orchestrate batch backfills and replay workflows without interrupting real-time ingestion.
  7. Define monitoring, alerting, and failure recovery for all stages.

Constraints

  • AWS is the required cloud platform.
  • The team already uses Apache Airflow 2.x and Amazon S3.
  • Incremental budget is capped at $18K/month.
  • PII in logs must be masked before indexing into Splunk.
  • The design should avoid vendor lock-in for preprocessing so another sink could be added later.

Problem

Context

PulseCart, a mid-sized e-commerce company, runs customer-facing services on Kubernetes and currently ships application logs, infrastructure metrics, and deployment events into separate tools. The operations team wants a unified observability pipeline centered on Splunk so they can search logs, correlate incidents, detect failures faster, and support postmortems without relying on manual log collection.

You are asked to design a production-grade pipeline that ingests operational telemetry from microservices, databases, and Kubernetes clusters into Splunk for monitoring and troubleshooting, while also preserving raw data in low-cost storage for replay and audit.

Scale Requirements

  • Services: 120 microservices across 3 Kubernetes clusters
  • Log volume: 2 TB/day raw JSON logs
  • Metrics/events: 1.5M metric datapoints/minute, 50K deployment/audit events/hour
  • Peak throughput: 80K log events/second during incidents
  • Latency target: searchable in Splunk within 60 seconds
  • Retention: 30 days hot in Splunk, 180 days archived in S3

Requirements

  1. Ingest logs, metrics, and operational events from Kubernetes, application services, CI/CD, and PostgreSQL into a centralized pipeline.
  2. Normalize records into a common schema with fields such as service, env, cluster, trace_id, severity, and event_time.
  3. Route valid telemetry to Splunk for observability use cases including incident triage, alerting, dashboarding, and root-cause analysis.
  4. Persist raw and rejected records to Amazon S3 for replay, compliance, and backfills.
  5. Implement data quality checks for malformed JSON, missing required fields, duplicate deployment events, and timestamp drift.
  6. Orchestrate batch backfills and replay workflows without interrupting real-time ingestion.
  7. Define monitoring, alerting, and failure recovery for all stages.

Constraints

  • AWS is the required cloud platform.
  • The team already uses Apache Airflow 2.x and Amazon S3.
  • Incremental budget is capped at $18K/month.
  • PII in logs must be masked before indexing into Splunk.
  • The design should avoid vendor lock-in for preprocessing so another sink could be added later.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
Design HA Telemetry Ingestion PipelineHardMetaDesign Production Observability PipelineHardDatabricksDesign Petabyte-Scale Log Streaming PipelineHard
Next question