Context
CollabHub, a document collaboration platform, currently generates notifications from application databases using periodic ETL jobs and ad hoc service callbacks. Users frequently receive duplicate notifications (multiple edits collapsed incorrectly) or stale updates that arrive after newer ones, especially during bursty multi-user editing sessions.
You are asked to design a data pipeline that produces collaboration notifications in near real time while preserving per-document ordering, preventing duplicates, and supporting replay/backfill when downstream systems fail.
Scale Requirements
- Active users: 18M monthly active users, 1.2M daily active collaborators
- Event throughput: 120K events/sec peak, 25K events/sec average
- Event types: comments, mentions, edits, reactions, permission changes, task assignments
- Payload size: 1-3 KB JSON per event
- Latency target: notification visible to user in < 3 seconds P95
- Ordering requirement: strict ordering per
document_id and thread_id
- Retention: raw event log for 30 days, notification state for 180 days
Requirements
- Design an ingestion and stream-processing pipeline that guarantees idempotent notification generation under retries and duplicate source events.
- Ensure per-entity ordering for collaborative updates while allowing horizontal scale across many documents.
- Support notification aggregation rules such as "N new comments" or "Alice and 3 others edited this doc" without emitting stale summaries.
- Persist both raw events and materialized notification state for replay, debugging, and backfills.
- Define data quality checks for malformed events, missing sequence numbers, and clock skew.
- Describe orchestration for schema changes, reprocessing, and downstream delivery retries.
- Include monitoring, alerting, and failure recovery for lag, out-of-order rates, duplicate rates, and delivery failures.
Constraints
- Infrastructure must stay on AWS and integrate with existing PostgreSQL, Redis, and mobile/web push services.
- Incremental platform budget is $35K/month.
- PII in notification payloads must be encrypted at rest and deleted within 7 days of account deletion requests.
- The team has strong SQL/Python skills but limited experience operating large self-managed clusters.