You are building the synchronization pipeline for an offline-first mobile app where users can create, edit, and delete records while disconnected. The current implementation pushes full-table snapshots when connectivity returns, which causes duplicate writes, lost updates, and inconsistent state across devices. A recent escalation showed that users editing the same entity from multiple devices can see stale data for hours, and support cannot reliably reconstruct what happened. You need a pipeline that keeps local storage and the remote backend in sync with predictable conflict handling and operational visibility.
| Component | Status |
|---|---|
| Mobile local store | Alten mobile app using SQLite/Room with pending_ops table |
| Sync transport | HTTPS sync API with periodic retries |
| Backend ingestion | Node.js service writing directly to PostgreSQL |
| Change propagation | Polling endpoint returning updated_at-based deltas |
| Orchestration | Alten backend cron jobs and Redis-backed work queues |
| Analytics/debugging | Basic application logs only |
Scale: 8M MAU, 1.2M DAU, ~25 syncable records/user, 15K writes/sec peak after reconnect events, 3 devices/user at P95, target sync convergence <30s after connectivity returns, 99.95% durability for acknowledged writes.
How would you redesign this as a resilient data pipeline so local mutations, server-side state, and downstream change feeds stay consistent under retries, out-of-order delivery, and multi-device conflicts? Explain the ingestion, processing, storage, and monitoring approach you would use to guarantee idempotent replay and observable reconciliation at this scale.