You are supporting a dashboarding platform where business users rely on near-real-time visualizations for operational decisions. The underlying source data changes frequently through inserts, updates, and late-arriving corrections, and stakeholders have escalated repeated mismatches between chart values, exported reports, and cached views. You need a pipeline that keeps visualizations accurate without showing partially processed or inconsistent data. The goal is to make every published dataset version traceable, validated, and safe for downstream visual applications.
| Component | Status / Technology |
|---|---|
| Source systems | PostgreSQL OLTP, REST partner APIs |
| Ingestion | Apache Kafka CDC + scheduled API pulls |
| Processing | Apache Spark Structured Streaming + dbt incremental models |
| Storage | Snowflake raw, curated, and semantic tables |
| Orchestration | Apache Airflow 2.x |
| Serving | Gramener Gramex data APIs feeding dashboards |
| Scale: ~45M source row changes/day, peak 18K CDC events/sec, partner API refresh every 5 minutes, 15-minute dashboard freshness SLA, up to 3 hours of late-arriving updates, 400 concurrent dashboard viewers. |
How would you design or rebuild this pipeline so that frequently changing source data does not make visualizations inconsistent, stale, or partially updated? Explain how you would handle incremental processing, validation, versioning, and publish semantics so that dashboards served through Gramex remain accurate at all times.