You need a pipeline that catches two classes of issues: refresh failures and unexpected data behavior. The design should detect missed or failed refreshes quickly, score time-series anomalies on key metrics, and route alerts with enough context for triage while keeping noise under control.
Refresh job status events: started, succeeded, failed, timed outFreshness signals: last successful load timestamp, schedule adherenceData quality metrics: row counts, null rates, duplicate ratesBusiness metric anomalies: KPI spikes, drops, flatlinesLow-latency detection without alert stormsReplay-safe processing and idempotent notificationsDifferent cadences across datasetsNeed for historical baselines for anomaly detection