Context
CollabCloud is a document collaboration platform with 18M monthly active users. Its current analytics and operational data platform relies on hourly batch ETL from PostgreSQL and application logs into Snowflake, which is insufficient for tracking live editing activity, presence, conflict events, and collaboration health as the product expands to millions of concurrent users.
The data engineering team needs a pipeline that captures document edits, cursor movements, comments, and presence updates in near real time for operational dashboards, product analytics, and downstream ML features such as anomaly detection and collaboration quality scoring.
Scale Requirements
- Users: 18M MAU, 2.5M DAU, up to 1.2M concurrent collaborators
- Throughput: 250K events/sec average, 900K events/sec peak during business hours
- Event size: 1-3 KB JSON payloads
- Latency target: P95 event-to-queryable latency under 90 seconds
- Storage: ~12 TB/day raw event volume, 180-day raw retention, 3-year curated retention
- Availability: 99.95% ingestion uptime
Requirements
- Ingest collaboration events from WebSocket gateways, REST APIs, and mobile clients with ordered processing per
document_id.
- Support real-time transformations for deduplication, schema validation, enrichment, and sessionization.
- Produce both operational aggregates (active docs, edit conflicts, presence counts) and analytics-ready fact tables.
- Ensure idempotent processing and replay capability for backfills or downstream recovery.
- Orchestrate batch and streaming dependencies so late events are reconciled into warehouse tables.
- Implement monitoring for lag, data quality, cost, and end-to-end freshness.
Constraints
- Existing stack is AWS-centric: EKS, S3, Snowflake, Airflow
- Team has strong Python/SQL skills but limited Flink expertise
- Incremental budget cap: $40K/month
- Compliance: SOC 2 and GDPR, including deletion requests within 72 hours
- Product cannot tolerate data loss for document mutation events