Build Real-Time Marketing Event Pipeline

Context

PulseReach, a B2C marketing automation platform, currently ingests customer events from web SDKs, mobile apps, CRM updates, and email engagement logs into hourly S3 batches processed overnight. That architecture is too slow for use cases such as cart-abandonment campaigns, suppression lists, and near-real-time audience segmentation.

You need to design a high-throughput event pipeline that supports operational segmentation within minutes while still feeding the warehouse for historical analytics and model features.

Scale Requirements

Throughput: 250K events/sec peak, 60K avg
Event types: page_view, product_view, add_to_cart, purchase, email_open, email_click, profile_update
Event size: 1-3 KB JSON
Daily volume: ~5B events/day, ~9-12 TB raw/day
Latency target: segment membership updates in < 2 minutes, warehouse availability in < 5 minutes
Retention: raw immutable events for 180 days; curated aggregates for 2 years

Requirements

Ingest events from browser/mobile SDKs, backend APIs, and SaaS connectors with durable buffering and replay.
Validate schema, deduplicate events, and enforce idempotent downstream writes.
Enrich events with customer identity resolution and consent status before segmentation.
Compute near-real-time segment membership (for example: "viewed product twice, no purchase in 24h, email-opted-in").
Persist raw and curated data for analytics, backfills, and campaign auditability.
Orchestrate batch backfills and dimension refreshes without disrupting streaming SLAs.
Provide monitoring, alerting, and recovery for late data, schema drift, and downstream outages.

Constraints

AWS-first environment with existing S3, Snowflake, and Airflow footprint
Incremental budget cap of $35K/month
Must support GDPR/CCPA deletion requests within 72 hours
Small platform team: 5 data engineers, 1 SRE
Campaign systems require exactly-once or effectively-once segment updates to avoid duplicate sends

Context

You need to design a high-throughput event pipeline that supports operational segmentation within minutes while still feeding the warehouse for historical analytics and model features.

Scale Requirements

Throughput: 250K events/sec peak, 60K avg
Event types: page_view, product_view, add_to_cart, purchase, email_open, email_click, profile_update
Event size: 1-3 KB JSON
Daily volume: ~5B events/day, ~9-12 TB raw/day
Latency target: segment membership updates in < 2 minutes, warehouse availability in < 5 minutes
Retention: raw immutable events for 180 days; curated aggregates for 2 years

Requirements

Ingest events from browser/mobile SDKs, backend APIs, and SaaS connectors with durable buffering and replay.
Validate schema, deduplicate events, and enforce idempotent downstream writes.
Enrich events with customer identity resolution and consent status before segmentation.
Compute near-real-time segment membership (for example: "viewed product twice, no purchase in 24h, email-opted-in").
Persist raw and curated data for analytics, backfills, and campaign auditability.
Orchestrate batch backfills and dimension refreshes without disrupting streaming SLAs.
Provide monitoring, alerting, and recovery for late data, schema drift, and downstream outages.

Constraints

AWS-first environment with existing S3, Snowflake, and Airflow footprint
Incremental budget cap of $35K/month
Must support GDPR/CCPA deletion requests within 72 hours
Small platform team: 5 data engineers, 1 SRE
Campaign systems require exactly-once or effectively-once segment updates to avoid duplicate sends

Context

You need to design a high-throughput event pipeline that supports operational segmentation within minutes while still feeding the warehouse for historical analytics and model features.

Scale Requirements

Throughput: 250K events/sec peak, 60K avg
Event types: page_view, product_view, add_to_cart, purchase, email_open, email_click, profile_update
Event size: 1-3 KB JSON
Daily volume: ~5B events/day, ~9-12 TB raw/day
Latency target: segment membership updates in < 2 minutes, warehouse availability in < 5 minutes
Retention: raw immutable events for 180 days; curated aggregates for 2 years

Requirements

Ingest events from browser/mobile SDKs, backend APIs, and SaaS connectors with durable buffering and replay.
Validate schema, deduplicate events, and enforce idempotent downstream writes.
Enrich events with customer identity resolution and consent status before segmentation.
Compute near-real-time segment membership (for example: "viewed product twice, no purchase in 24h, email-opted-in").
Persist raw and curated data for analytics, backfills, and campaign auditability.
Orchestrate batch backfills and dimension refreshes without disrupting streaming SLAs.
Provide monitoring, alerting, and recovery for late data, schema drift, and downstream outages.

Constraints

AWS-first environment with existing S3, Snowflake, and Airflow footprint
Incremental budget cap of $35K/month
Must support GDPR/CCPA deletion requests within 72 hours
Small platform team: 5 data engineers, 1 SRE
Campaign systems require exactly-once or effectively-once segment updates to avoid duplicate sends

Context

You need to design a high-throughput event pipeline that supports operational segmentation within minutes while still feeding the warehouse for historical analytics and model features.

Scale Requirements

Throughput: 250K events/sec peak, 60K avg
Event types: page_view, product_view, add_to_cart, purchase, email_open, email_click, profile_update
Event size: 1-3 KB JSON
Daily volume: ~5B events/day, ~9-12 TB raw/day
Latency target: segment membership updates in < 2 minutes, warehouse availability in < 5 minutes
Retention: raw immutable events for 180 days; curated aggregates for 2 years

Requirements

Ingest events from browser/mobile SDKs, backend APIs, and SaaS connectors with durable buffering and replay.
Validate schema, deduplicate events, and enforce idempotent downstream writes.
Enrich events with customer identity resolution and consent status before segmentation.
Compute near-real-time segment membership (for example: "viewed product twice, no purchase in 24h, email-opted-in").
Persist raw and curated data for analytics, backfills, and campaign auditability.
Orchestrate batch backfills and dimension refreshes without disrupting streaming SLAs.
Provide monitoring, alerting, and recovery for late data, schema drift, and downstream outages.

Constraints

AWS-first environment with existing S3, Snowflake, and Airflow footprint
Incremental budget cap of $35K/month
Must support GDPR/CCPA deletion requests within 72 hours
Small platform team: 5 data engineers, 1 SRE
Campaign systems require exactly-once or effectively-once segment updates to avoid duplicate sends

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Build Real-Time Marketing Event Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Build Real-Time Marketing Event Pipeline

Context

Scale Requirements

Requirements

Constraints

Build Real-Time Marketing Event Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer