Context
Northstar Retail’s operations analysts currently use Microsoft Excel to inspect daily CSV exports from ERP, CRM, and warehouse systems, then build manual quality reports with pivots, formulas, and conditional formatting. The process works for ad hoc analysis, but it is error-prone, takes 3-4 analyst hours per day, and cannot scale as source volumes grow.
You are asked to design a production data pipeline that preserves Excel-based reporting for business users while automating ingestion, validation, transformation, and report delivery.
Scale Requirements
- Sources: 3 operational systems exporting CSV/XLSX files to SharePoint and SFTP
- Volume: 8 million rows/day total, ~12 GB raw/day
- Batch cadence: Hourly ingestion, daily executive quality report by 7:00 AM local time
- Latency target: New source files available in reporting tables within 15 minutes
- Retention: 2 years raw files, 5 years aggregated quality metrics
- Users: 40 analysts consuming Excel workbooks and Power BI dashboards
Requirements
- Build a batch ETL pipeline that ingests CSV/XLSX files from SharePoint and SFTP into a centralized data platform.
- Standardize schemas across sources, including date parsing, column normalization, and type enforcement.
- Implement data quality checks for nulls, duplicates, invalid product codes, negative quantities, and row-count anomalies.
- Store raw and curated datasets separately and maintain auditability from report output back to source file.
- Generate analyst-friendly outputs consumable in Microsoft Excel, including a daily quality summary and exception-level detail extract.
- Orchestrate hourly loads and daily report refresh with dependency management and retry logic.
- Design monitoring for pipeline failures, SLA misses, and quality threshold breaches.
Constraints
- Existing stack is Azure-based; prefer managed services over self-hosted infrastructure.
- Budget is limited to a small data engineering team and moderate cloud spend.
- Business users must continue using Excel as a primary interface.
- PII exists in customer files; access must be role-based and auditable.
- Some source files arrive late or are re-sent with corrected records, so the pipeline must support idempotent reprocessing and backfills.