Context
PulseReach, a B2C marketing automation platform, currently ingests customer events from web SDKs, mobile apps, CRM updates, and email engagement logs into hourly S3 batches processed overnight. That architecture is too slow for use cases such as cart-abandonment campaigns, suppression lists, and near-real-time audience segmentation.
You need to design a high-throughput event pipeline that supports operational segmentation within minutes while still feeding the warehouse for historical analytics and model features.
Scale Requirements
- Throughput: 250K events/sec peak, 60K avg
- Event types: page_view, product_view, add_to_cart, purchase, email_open, email_click, profile_update
- Event size: 1-3 KB JSON
- Daily volume: ~5B events/day, ~9-12 TB raw/day
- Latency target: segment membership updates in < 2 minutes, warehouse availability in < 5 minutes
- Retention: raw immutable events for 180 days; curated aggregates for 2 years
Requirements
- Ingest events from browser/mobile SDKs, backend APIs, and SaaS connectors with durable buffering and replay.
- Validate schema, deduplicate events, and enforce idempotent downstream writes.
- Enrich events with customer identity resolution and consent status before segmentation.
- Compute near-real-time segment membership (for example: "viewed product twice, no purchase in 24h, email-opted-in").
- Persist raw and curated data for analytics, backfills, and campaign auditability.
- Orchestrate batch backfills and dimension refreshes without disrupting streaming SLAs.
- Provide monitoring, alerting, and recovery for late data, schema drift, and downstream outages.
Constraints
- AWS-first environment with existing S3, Snowflake, and Airflow footprint
- Incremental budget cap of $35K/month
- Must support GDPR/CCPA deletion requests within 72 hours
- Small platform team: 5 data engineers, 1 SRE
- Campaign systems require exactly-once or effectively-once segment updates to avoid duplicate sends