Context
InsightMart, a mid-sized retail analytics company, serves executive dashboards in Tableau backed by Snowflake. Today, sales, inventory, and customer data are loaded through ad hoc Python scripts and daily SQL jobs, causing inconsistent metrics across dashboards and frequent trust issues when finance and operations compare numbers.
Your task is to design a production-grade batch data pipeline that ensures the data visualized in BI dashboards is accurate, reliable, and auditable before it reaches analysts and business users.
Scale Requirements
- Sources: PostgreSQL OLTP (orders, customers), Shopify API, CSV files from 120 stores via S3
- Volume: 250M order records, 40M customer records, 15GB new raw data/day
- Load cadence: Hourly for orders and inventory, daily for customer and store reference data
- Latency target: Source-to-dashboard freshness under 90 minutes for hourly datasets
- Retention: 2 years raw data, 5 years curated aggregates
- Availability target: 99.9% successful scheduled pipeline runs per month
Requirements
- Ingest data from PostgreSQL, Shopify API, and S3 CSV drops into a centralized raw layer.
- Validate schema, null rates, duplicate keys, referential integrity, and freshness before publishing dashboard tables.
- Build standardized transformed models for facts and dimensions used by Tableau dashboards.
- Support idempotent reruns, backfills for missed partitions, and auditability of every load.
- Quarantine bad records without blocking all downstream processing unless critical quality thresholds fail.
- Expose data quality status and pipeline health to engineers and analytics stakeholders.
- Ensure metric consistency so dashboards use certified tables rather than source-specific extracts.
Constraints
- Existing stack is AWS + Snowflake; avoid introducing heavy operational overhead.
- Team has 3 data engineers and 1 analytics engineer.
- Incremental cloud/tooling budget is capped at $18K/month.
- Finance dashboards are SOX-sensitive, so lineage and reproducibility are required.
- Historical source corrections may arrive up to 7 days late and must be reconciled cleanly.