Build Splunk Observability Log Pipeline

Context

PulseCart, a mid-sized e-commerce company, runs customer-facing services on Kubernetes and currently ships application logs, infrastructure metrics, and deployment events into separate tools. The operations team wants a unified observability pipeline centered on Splunk so they can search logs, correlate incidents, detect failures faster, and support postmortems without relying on manual log collection.

You are asked to design a production-grade pipeline that ingests operational telemetry from microservices, databases, and Kubernetes clusters into Splunk for monitoring and troubleshooting, while also preserving raw data in low-cost storage for replay and audit.

Scale Requirements

Services: 120 microservices across 3 Kubernetes clusters
Log volume: 2 TB/day raw JSON logs
Metrics/events: 1.5M metric datapoints/minute, 50K deployment/audit events/hour
Peak throughput: 80K log events/second during incidents
Latency target: searchable in Splunk within 60 seconds
Retention: 30 days hot in Splunk, 180 days archived in S3

Requirements

Ingest logs, metrics, and operational events from Kubernetes, application services, CI/CD, and PostgreSQL into a centralized pipeline.
Normalize records into a common schema with fields such as service, env, cluster, trace_id, severity, and event_time.
Route valid telemetry to Splunk for observability use cases including incident triage, alerting, dashboarding, and root-cause analysis.
Persist raw and rejected records to Amazon S3 for replay, compliance, and backfills.
Implement data quality checks for malformed JSON, missing required fields, duplicate deployment events, and timestamp drift.
Orchestrate batch backfills and replay workflows without interrupting real-time ingestion.
Define monitoring, alerting, and failure recovery for all stages.

Constraints

AWS is the required cloud platform.
The team already uses Apache Airflow 2.x and Amazon S3.
Incremental budget is capped at $18K/month.
PII in logs must be masked before indexing into Splunk.
The design should avoid vendor lock-in for preprocessing so another sink could be added later.

Context

Scale Requirements

Services: 120 microservices across 3 Kubernetes clusters

Log volume: 2 TB/day raw JSON logs

Metrics/events: 1.5M metric datapoints/minute, 50K deployment/audit events/hour

Peak throughput: 80K log events/second during incidents

Latency target: searchable in Splunk within 60 seconds

Retention: 30 days hot in Splunk, 180 days archived in S3

Requirements

Ingest logs, metrics, and operational events from Kubernetes, application services, CI/CD, and PostgreSQL into a centralized pipeline.

Normalize records into a common schema with fields such as service, env, cluster, trace_id, severity, and event_time.

Route valid telemetry to Splunk for observability use cases including incident triage, alerting, dashboarding, and root-cause analysis.

Persist raw and rejected records to Amazon S3 for replay, compliance, and backfills.

Implement data quality checks for malformed JSON, missing required fields, duplicate deployment events, and timestamp drift.

Orchestrate batch backfills and replay workflows without interrupting real-time ingestion.

Define monitoring, alerting, and failure recovery for all stages.

Problem

Context

Scale Requirements

Requirements

Constraints

Build Splunk Observability Log Pipeline

Problem

Context

Scale Requirements

Requirements

Constraints