Design Real-Time Collaboration Pipeline

Context

CollabCloud is a document collaboration platform with 18M monthly active users. Its current analytics and operational data platform relies on hourly batch ETL from PostgreSQL and application logs into Snowflake, which is insufficient for tracking live editing activity, presence, conflict events, and collaboration health as the product expands to millions of concurrent users.

The data engineering team needs a pipeline that captures document edits, cursor movements, comments, and presence updates in near real time for operational dashboards, product analytics, and downstream ML features such as anomaly detection and collaboration quality scoring.

Scale Requirements

Users: 18M MAU, 2.5M DAU, up to 1.2M concurrent collaborators
Throughput: 250K events/sec average, 900K events/sec peak during business hours
Event size: 1-3 KB JSON payloads
Latency target: P95 event-to-queryable latency under 90 seconds
Storage: ~12 TB/day raw event volume, 180-day raw retention, 3-year curated retention
Availability: 99.95% ingestion uptime

Requirements

Ingest collaboration events from WebSocket gateways, REST APIs, and mobile clients with ordered processing per document_id.
Support real-time transformations for deduplication, schema validation, enrichment, and sessionization.
Produce both operational aggregates (active docs, edit conflicts, presence counts) and analytics-ready fact tables.
Ensure idempotent processing and replay capability for backfills or downstream recovery.
Orchestrate batch and streaming dependencies so late events are reconciled into warehouse tables.
Implement monitoring for lag, data quality, cost, and end-to-end freshness.

Constraints

Existing stack is AWS-centric: EKS, S3, Snowflake, Airflow
Team has strong Python/SQL skills but limited Flink expertise
Incremental budget cap: $40K/month
Compliance: SOC 2 and GDPR, including deletion requests within 72 hours
Product cannot tolerate data loss for document mutation events

Context

Scale Requirements

Users: 18M MAU, 2.5M DAU, up to 1.2M concurrent collaborators
Throughput: 250K events/sec average, 900K events/sec peak during business hours
Event size: 1-3 KB JSON payloads
Latency target: P95 event-to-queryable latency under 90 seconds
Storage: ~12 TB/day raw event volume, 180-day raw retention, 3-year curated retention
Availability: 99.95% ingestion uptime

Requirements

Ingest collaboration events from WebSocket gateways, REST APIs, and mobile clients with ordered processing per document_id.
Support real-time transformations for deduplication, schema validation, enrichment, and sessionization.
Produce both operational aggregates (active docs, edit conflicts, presence counts) and analytics-ready fact tables.
Ensure idempotent processing and replay capability for backfills or downstream recovery.
Orchestrate batch and streaming dependencies so late events are reconciled into warehouse tables.
Implement monitoring for lag, data quality, cost, and end-to-end freshness.

Constraints

Existing stack is AWS-centric: EKS, S3, Snowflake, Airflow
Team has strong Python/SQL skills but limited Flink expertise
Incremental budget cap: $40K/month
Compliance: SOC 2 and GDPR, including deletion requests within 72 hours
Product cannot tolerate data loss for document mutation events

Context

Scale Requirements

Users: 18M MAU, 2.5M DAU, up to 1.2M concurrent collaborators
Throughput: 250K events/sec average, 900K events/sec peak during business hours
Event size: 1-3 KB JSON payloads
Latency target: P95 event-to-queryable latency under 90 seconds
Storage: ~12 TB/day raw event volume, 180-day raw retention, 3-year curated retention
Availability: 99.95% ingestion uptime

Requirements

Ingest collaboration events from WebSocket gateways, REST APIs, and mobile clients with ordered processing per document_id.
Support real-time transformations for deduplication, schema validation, enrichment, and sessionization.
Produce both operational aggregates (active docs, edit conflicts, presence counts) and analytics-ready fact tables.
Ensure idempotent processing and replay capability for backfills or downstream recovery.
Orchestrate batch and streaming dependencies so late events are reconciled into warehouse tables.
Implement monitoring for lag, data quality, cost, and end-to-end freshness.

Constraints

Existing stack is AWS-centric: EKS, S3, Snowflake, Airflow
Team has strong Python/SQL skills but limited Flink expertise
Incremental budget cap: $40K/month
Compliance: SOC 2 and GDPR, including deletion requests within 72 hours
Product cannot tolerate data loss for document mutation events

Context

Scale Requirements

Users: 18M MAU, 2.5M DAU, up to 1.2M concurrent collaborators
Throughput: 250K events/sec average, 900K events/sec peak during business hours
Event size: 1-3 KB JSON payloads
Latency target: P95 event-to-queryable latency under 90 seconds
Storage: ~12 TB/day raw event volume, 180-day raw retention, 3-year curated retention
Availability: 99.95% ingestion uptime

Requirements

Ingest collaboration events from WebSocket gateways, REST APIs, and mobile clients with ordered processing per document_id.
Support real-time transformations for deduplication, schema validation, enrichment, and sessionization.
Produce both operational aggregates (active docs, edit conflicts, presence counts) and analytics-ready fact tables.
Ensure idempotent processing and replay capability for backfills or downstream recovery.
Orchestrate batch and streaming dependencies so late events are reconciled into warehouse tables.
Implement monitoring for lag, data quality, cost, and end-to-end freshness.

Constraints

Existing stack is AWS-centric: EKS, S3, Snowflake, Airflow
Team has strong Python/SQL skills but limited Flink expertise
Incremental budget cap: $40K/month
Compliance: SOC 2 and GDPR, including deletion requests within 72 hours
Product cannot tolerate data loss for document mutation events

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Real-Time Collaboration Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Real-Time Collaboration Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Real-Time Collaboration Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer