You are designing the synchronization pipeline for a mobile marketplace where drivers and customers can create or update entities while offline, then reconnect hours later on unstable networks. Recent incidents showed duplicated actions, out-of-order updates, and mismatches between what the mobile app displays and what backend systems persist. Product and operations teams now want a resilient sync design that supports eventual consistency, conflict handling, auditability, and near-real-time downstream visibility for operational dashboards. The pipeline must preserve user intent while preventing duplicate writes when the client retries aggressively after reconnect.
| Component | Status / Technology |
|---|---|
| Mobile event capture | iOS/Android app with local SQLite queue |
| API layer | Sync REST endpoints behind API gateway |
| Operational store | PostgreSQL primary with read replicas |
| Async messaging | Apache Kafka used for backend domain events |
| Analytics pipeline | Airflow batch jobs loading warehouse tables hourly |
| Monitoring | Basic API latency and DB CPU dashboards |
Scale: 12M monthly active devices, 1.8M daily active devices, peak reconnect bursts of 45K requests/sec after network recovery, up to 150 queued mutations/device, payloads 1-8 KB, operational state visible within 2 seconds and warehouse freshness under 10 minutes.
How would you design the end-to-end data synchronization pipeline so offline mobile mutations can be replayed safely, ordered correctly where needed, reconciled on conflict, and propagated to both operational systems and downstream analytical stores without double-processing?