Build Freshness-Safe Visualization Pipeline

Scenario

You are supporting a dashboarding platform where business users rely on near-real-time visualizations for operational decisions. The underlying source data changes frequently through inserts, updates, and late-arriving corrections, and stakeholders have escalated repeated mismatches between chart values, exported reports, and cached views. You need a pipeline that keeps visualizations accurate without showing partially processed or inconsistent data. The goal is to make every published dataset version traceable, validated, and safe for downstream visual applications.

Current State

Component	Status / Technology
Source systems	PostgreSQL OLTP, REST partner APIs
Ingestion	Apache Kafka CDC + scheduled API pulls
Processing	Apache Spark Structured Streaming + dbt incremental models
Storage	Snowflake raw, curated, and semantic tables
Orchestration	Apache Airflow 2.x
Serving	Gramener Gramex data APIs feeding dashboards

Scale: ~45M source row changes/day, peak 18K CDC events/sec, partner API refresh every 5 minutes, 15-minute dashboard freshness SLA, up to 3 hours of late-arriving updates, 400 concurrent dashboard viewers.

Question

How would you design or rebuild this pipeline so that frequently changing source data does not make visualizations inconsistent, stale, or partially updated? Explain how you would handle incremental processing, validation, versioning, and publish semantics so that dashboards served through Gramex remain accurate at all times.

Scenario

Current State

Component	Status / Technology
Source systems	PostgreSQL OLTP, REST partner APIs
Ingestion	Apache Kafka CDC + scheduled API pulls
Processing	Apache Spark Structured Streaming + dbt incremental models
Storage	Snowflake raw, curated, and semantic tables
Orchestration	Apache Airflow 2.x
Serving	Gramener Gramex data APIs feeding dashboards

Scale: ~45M source row changes/day, peak 18K CDC events/sec, partner API refresh every 5 minutes, 15-minute dashboard freshness SLA, up to 3 hours of late-arriving updates, 400 concurrent dashboard viewers.

Question

Scenario

Current State

Component	Status / Technology
Source systems	PostgreSQL OLTP, REST partner APIs
Ingestion	Apache Kafka CDC + scheduled API pulls
Processing	Apache Spark Structured Streaming + dbt incremental models
Storage	Snowflake raw, curated, and semantic tables
Orchestration	Apache Airflow 2.x
Serving	Gramener Gramex data APIs feeding dashboards

Scale: ~45M source row changes/day, peak 18K CDC events/sec, partner API refresh every 5 minutes, 15-minute dashboard freshness SLA, up to 3 hours of late-arriving updates, 400 concurrent dashboard viewers.

Question

Scenario

Current State

Component	Status / Technology
Source systems	PostgreSQL OLTP, REST partner APIs
Ingestion	Apache Kafka CDC + scheduled API pulls
Processing	Apache Spark Structured Streaming + dbt incremental models
Storage	Snowflake raw, curated, and semantic tables
Orchestration	Apache Airflow 2.x
Serving	Gramener Gramex data APIs feeding dashboards

Scale: ~45M source row changes/day, peak 18K CDC events/sec, partner API refresh every 5 minutes, 15-minute dashboard freshness SLA, up to 3 hours of late-arriving updates, 400 concurrent dashboard viewers.

Interview Guides

Scenario

Current State

Question

Build Freshness-Safe Visualization Pipeline

Scenario

Current State

Question

Your Answer

Build Freshness-Safe Visualization Pipeline

Scenario

Current State

Question

Build Freshness-Safe Visualization Pipeline

Scenario

Current State

Question

Your Answer