Context
PulseCart, a mid-sized e-commerce company, runs customer-facing services on Kubernetes and currently ships application logs, infrastructure metrics, and deployment events into separate tools. The operations team wants a unified observability pipeline centered on Splunk so they can search logs, correlate incidents, detect failures faster, and support postmortems without relying on manual log collection.
You are asked to design a production-grade pipeline that ingests operational telemetry from microservices, databases, and Kubernetes clusters into Splunk for monitoring and troubleshooting, while also preserving raw data in low-cost storage for replay and audit.
Scale Requirements
- Services: 120 microservices across 3 Kubernetes clusters
- Log volume: 2 TB/day raw JSON logs
- Metrics/events: 1.5M metric datapoints/minute, 50K deployment/audit events/hour
- Peak throughput: 80K log events/second during incidents
- Latency target: searchable in Splunk within 60 seconds
- Retention: 30 days hot in Splunk, 180 days archived in S3
Requirements
- Ingest logs, metrics, and operational events from Kubernetes, application services, CI/CD, and PostgreSQL into a centralized pipeline.
- Normalize records into a common schema with fields such as
service, env, cluster, trace_id, severity, and event_time.
- Route valid telemetry to Splunk for observability use cases including incident triage, alerting, dashboarding, and root-cause analysis.
- Persist raw and rejected records to Amazon S3 for replay, compliance, and backfills.
- Implement data quality checks for malformed JSON, missing required fields, duplicate deployment events, and timestamp drift.
- Orchestrate batch backfills and replay workflows without interrupting real-time ingestion.
- Define monitoring, alerting, and failure recovery for all stages.
Constraints
- AWS is the required cloud platform.
- The team already uses Apache Airflow 2.x and Amazon S3.
- Incremental budget is capped at $18K/month.
- PII in logs must be masked before indexing into Splunk.
- The design should avoid vendor lock-in for preprocessing so another sink could be added later.