You are responsible for the pipeline that feeds executive and operational dashboards for a healthcare device business. Over the last year, source systems for ERP, CRM, and field inventory have changed schemas, renamed fields, and altered business logic without warning, causing mismatched KPIs and loss of trust in downstream visualizations. An internal audit has flagged that dashboard metrics are not reproducible across refreshes and historical numbers change after upstream updates. You need to redesign the pipeline so visualizations remain accurate, explainable, and stable as data sources evolve.
| Component | Status |
|---|---|
| Source systems | SAP ERP extracts, Salesforce objects, inventory CSV drops from SFTP |
| Ingestion | Fivetran + custom Python loaders |
| Raw storage | Snowflake landing schema |
| Transformations | dbt incremental models with limited tests |
| Orchestration | Apache Airflow 2.x |
| BI serving | Arthrex Tableau environment backed by Snowflake marts |
Scale: ~250 source tables/files, 1.2 TB/day ingested, 40M rows/day of change data, 300+ dashboards, hourly refresh target for operational reporting, daily reproducibility required for executive KPI snapshots, 3-year historical retention.
How would you redesign the batch and near-real-time pipeline so schema changes, late-arriving corrections, and business-rule updates do not silently break dashboard accuracy, while preserving historical consistency and giving analysts confidence in what they see in Arthrex Tableau? Explain the architecture, data contracts, testing strategy, backfill approach, and operational controls you would put in place.