You own the ingestion pipeline for a B2B SaaS security product that processes email security telemetry used by detection systems and customer-facing analytics. The current design works at steady state, but a new enterprise rollout is expected to increase inbound traffic by 10x over the next quarter. Recent incidents showed delayed downstream updates, duplicate records during retries, and inconsistent counts between raw and curated datasets. You need a minimalistic redesign that can absorb the traffic increase without introducing a large operational footprint.
| Component | Status / Technology |
|---|---|
| Event Sources | Email gateway webhooks, API collectors, internal app events |
| Ingestion | Python services writing directly to PostgreSQL |
| Processing | Cron-based Python ETL every 15 minutes |
| Storage | PostgreSQL for raw and transformed data |
| Orchestration | Basic cron on Kubernetes |
| Serving | Detection features and internal dashboards |
Scale: 25K events/sec peak today, expected 250K events/sec peak after rollout; average payload 3-5 KB JSON; current freshness is 15-20 minutes; target is under 3 minutes for curated tables; 30-day hot retention and 1-year cold retention.
How would you redesign this ingestion pipeline in the simplest production-ready way to handle 10x traffic while preserving data quality, replayability, and operational visibility? Walk through the architecture, scaling decisions, and the trade-offs you would make to keep the system intentionally minimal.