Context
TaskFlow, a collaborative work-management product similar to Asana, currently stores task, comment, and project updates in PostgreSQL and publishes websocket notifications directly from the application tier. This works for small teams, but it breaks down for enterprise workspaces where thousands of users edit the same project concurrently, and downstream systems (search, analytics, notifications, audit logs) need consistent real-time updates.
You need to design a real-time data pipeline that captures product changes, propagates them to online consumers, and lands them in analytics storage without relying on synchronous fan-out from the application database.
Scale Requirements
- Write throughput: 120K change events/sec peak, 25K/sec average
- Concurrent active users: 8M DAU, 600K concurrent websocket connections
- Event size: 1-4 KB JSON payloads
- Latency target: P95 < 2 seconds from DB commit to client-visible update; < 60 seconds to analytics warehouse
- Storage: 8 TB/day raw event retention for 30 days; curated warehouse retention for 2 years
- Availability: 99.95% for event propagation
Requirements
- Capture task, comment, membership, and project state changes with ordering guarantees per workspace.
- Support fan-out to multiple consumers: websocket push, search indexing, notifications, audit trail, and analytics.
- Ensure idempotent processing and deduplication for retries, replays, and CDC restarts.
- Handle schema evolution for event payloads without breaking downstream consumers.
- Provide replay/backfill capability for rebuilding search indexes or warehouse tables.
- Implement data quality checks for malformed events, missing keys, and out-of-order updates.
- Design orchestration, monitoring, and on-call recovery procedures.
Constraints
- Primary infrastructure is AWS; existing systems already use Amazon RDS PostgreSQL, EKS, S3, and Snowflake.
- Engineering team prefers managed services where possible; incremental budget is capped at $40K/month.
- Compliance requirements include SOC 2 auditability and GDPR deletion propagation within 72 hours.
- The application database cannot tolerate more than 5% additional write overhead from change capture.