Context
Asana needs a pipeline that supports analytics and operational reporting for real-time collaboration across projects, tasks, comments, status updates, and notifications. Today, core product events are landed in batch from application databases into the warehouse, which creates hours of delay and makes it hard to monitor collaboration health, power near-real-time dashboards, and investigate incidents affecting millions of users.
Design a scalable pipeline for Asana that captures collaboration events from web, mobile, and backend services, processes them in near real time, and publishes trusted datasets for analytics in the warehouse while preserving a replayable raw history.
Scale Requirements
- Users: 8M monthly active users, 1.5M daily active users
- Peak throughput: 250K events/second during weekday collaboration peaks
- Average event size: 1.5KB JSON
- Daily volume: ~20TB raw compressed data
- Latency target: P95 event-to-queryable under 3 minutes
- Retention: 180 days raw, 3 years curated aggregates
- Availability target: 99.9% for ingestion and processing
Requirements
- Ingest task, comment, project membership, notification, and workspace activity events from Asana web, mobile, and service backends.
- Guarantee idempotent processing for duplicate delivery and client retries using stable event IDs.
- Validate schemas, quarantine malformed records, and support backward-compatible schema evolution.
- Build both a raw append-only event store and curated warehouse tables for collaboration metrics such as active editors, task update velocity, and notification fanout.
- Support replay/backfill for 30 days without corrupting downstream aggregates.
- Orchestrate streaming and batch dependencies, including periodic dimension refreshes and late-arriving event correction.
- Define monitoring, alerting, and on-call recovery procedures.
Constraints
- Assume Asana is standardized on AWS, Snowflake, dbt, and Apache Airflow.
- Incremental infrastructure budget is capped at $35K/month.
- PII must be minimized in raw topics and deletions must propagate to curated tables within 72 hours.
- The team has 5 engineers, so operational complexity should stay moderate.