Context
AURORA maintains large-scale mapping datasets used by routing, search, and on-road intelligence products. Today, road graph, POI, address, and telemetry-derived map updates arrive from multiple partners as bulk files and event streams, but versioning is inconsistent and downstream teams cannot reliably reproduce a prior map state.
Design a pipeline on the AURORA data platform to ingest, validate, diff, and version-control mapping data so product teams can query both the latest map and any historical snapshot.
Scale Requirements
- Sources: 25 external providers + 3 internal AURORA producers
- Volume: 8-12 TB/day raw compressed data
- Entities: ~25B road segments, 2B POIs, 4B address points, 500M daily change events
- Peak ingest: 150K change events/sec streaming, plus hourly batch drops up to 800 GB/file
- Latency: streaming updates visible in versioned bronze within 2 minutes; curated silver/gold within 15 minutes
- Retention: full raw history for 2 years; reproducible map snapshots indefinitely
Requirements
- Build a unified ingestion layer for batch files and streaming change events into AURORA Pipelines.
- Create a versioning model that supports append-only history, point-in-time reconstruction, and efficient diff generation between map releases.
- Enforce schema validation, geospatial integrity checks, deduplication, and idempotent replay.
- Support late-arriving corrections, backfills for provider reissues, and rollback to a prior published map version.
- Expose curated tables for downstream consumers: latest state, change log, release snapshot, and provider quality metrics.
- Define orchestration, monitoring, and failure recovery for both hourly and continuous workloads.
- Explain partitioning, storage format, and indexing choices for large geospatial entities.
Constraints
- Primary cloud is AWS; prefer S3, Amazon MSK, EMR/Spark, and Snowflake already used by AURORA.
- Incremental budget target: <$60K/month excluding existing Snowflake spend.
- Must support auditability for every published map release.
- Provider schemas evolve frequently; breaking changes cannot block unrelated feeds.
- Downstream routing systems require deterministic release artifacts and zero partial publishes.