Context
DocFlow, a browser-based document collaboration product, currently stores document edits directly in PostgreSQL and rebuilds document state with periodic batch jobs for analytics and audit. This architecture cannot support real-time downstream consumers such as live presence, audit replay, search indexing, and conflict diagnostics, and it provides limited visibility into out-of-order edits and duplicate client retries.
You need to design a data pipeline that captures edit operations in real time, resolves or records conflicts consistently, and produces both low-latency operational views and durable historical data for replay and compliance.
Scale Requirements
- Active users: 12M monthly, 900K daily active, 120K concurrent editors at peak
- Write throughput: 250K edit operations/sec peak, 40K/sec average
- Event size: 0.8-2KB per operation (insert, delete, format, cursor, comment)
- Latency target: operation ingestion to materialized document state < 300 ms P95; analytics availability < 2 minutes
- Storage: 8TB/day raw event volume; retain raw ops for 1 year, snapshots for 90 days
- Consistency target: per-document operation ordering preserved; exactly-once downstream materialization where feasible
Requirements
- Build a streaming ingestion pipeline for edit operations from web and mobile clients.
- Preserve per-document ordering while handling retries, duplicates, and late/out-of-order events.
- Design conflict handling for concurrent edits using an approach such as OT or CRDT, and explain where that logic lives in the pipeline.
- Materialize current document state for low-latency reads and persist an immutable operation log for replay.
- Publish cleaned events to downstream consumers for search, analytics, and audit.
- Define data quality checks for schema validation, sequence gaps, duplicate operation IDs, and invalid version transitions.
- Describe orchestration, backfill, replay, and recovery strategies for corrupted state or processor failure.
Constraints
- Primary cloud is AWS; existing stack includes EKS, PostgreSQL, S3, and Airflow
- Incremental budget cap: $40K/month
- Must support tenant isolation and SOC 2 auditability
- User-visible data loss is unacceptable; temporary stale reads are acceptable for < 30 seconds during recovery