You are responsible for the pipeline feeding executive and client-facing Power BI dashboards for a subscription product. The dashboards refresh every few minutes, but source data changes frequently because orders, refunds, CRM updates, and product usage events arrive out of order and are occasionally corrected after initial ingestion. Stakeholders have escalated repeated mismatches between dashboard totals, finance extracts, and operational reports, and an audit found no consistent way to explain which data version a visual was built from. You need to redesign the pipeline so the visual layer remains trustworthy even while upstream data is volatile.
| Component | Status / Technology |
|---|---|
| Operational sources | Azure SQL Database, Dynamics 365, application event APIs |
| Ingestion | Azure Data Factory copy jobs every 15 min + Event Hubs for usage events |
| Processing | Mixed PySpark notebooks in Azure Databricks, limited replay support |
| Storage | Azure Data Lake Storage Gen2 + Azure Synapse Analytics serving tables |
| Semantic layer | Power BI datasets with scheduled refresh |
| Orchestration | Azure Data Factory triggers and ad hoc notebook runs |
| Scale: ~180M usage events/day, 12M CRM/order mutations/day, peak 25K events/sec, 15-minute freshness target for operational visuals, daily finance reconciliation, 2 years of retained history. |
How would you redesign this pipeline so downstream visuals in Power BI remain explainable, consistent, and auditable as source records are inserted, updated, deleted, and replayed throughout the day? Describe the end-to-end pipeline approach you would use to preserve trust while still meeting freshness requirements.