Automate Excel Quality Reporting Pipeline

Context

Northstar Retail’s operations analysts currently use Microsoft Excel to inspect daily CSV exports from ERP, CRM, and warehouse systems, then build manual quality reports with pivots, formulas, and conditional formatting. The process works for ad hoc analysis, but it is error-prone, takes 3-4 analyst hours per day, and cannot scale as source volumes grow.

You are asked to design a production data pipeline that preserves Excel-based reporting for business users while automating ingestion, validation, transformation, and report delivery.

Scale Requirements

Sources: 3 operational systems exporting CSV/XLSX files to SharePoint and SFTP
Volume: 8 million rows/day total, ~12 GB raw/day
Batch cadence: Hourly ingestion, daily executive quality report by 7:00 AM local time
Latency target: New source files available in reporting tables within 15 minutes
Retention: 2 years raw files, 5 years aggregated quality metrics
Users: 40 analysts consuming Excel workbooks and Power BI dashboards

Requirements

Build a batch ETL pipeline that ingests CSV/XLSX files from SharePoint and SFTP into a centralized data platform.
Standardize schemas across sources, including date parsing, column normalization, and type enforcement.
Implement data quality checks for nulls, duplicates, invalid product codes, negative quantities, and row-count anomalies.
Store raw and curated datasets separately and maintain auditability from report output back to source file.
Generate analyst-friendly outputs consumable in Microsoft Excel, including a daily quality summary and exception-level detail extract.
Orchestrate hourly loads and daily report refresh with dependency management and retry logic.
Design monitoring for pipeline failures, SLA misses, and quality threshold breaches.

Constraints

Existing stack is Azure-based; prefer managed services over self-hosted infrastructure.
Budget is limited to a small data engineering team and moderate cloud spend.
Business users must continue using Excel as a primary interface.
PII exists in customer files; access must be role-based and auditable.
Some source files arrive late or are re-sent with corrected records, so the pipeline must support idempotent reprocessing and backfills.

Context

You are asked to design a production data pipeline that preserves Excel-based reporting for business users while automating ingestion, validation, transformation, and report delivery.

Scale Requirements

Sources: 3 operational systems exporting CSV/XLSX files to SharePoint and SFTP
Volume: 8 million rows/day total, ~12 GB raw/day
Batch cadence: Hourly ingestion, daily executive quality report by 7:00 AM local time
Latency target: New source files available in reporting tables within 15 minutes
Retention: 2 years raw files, 5 years aggregated quality metrics
Users: 40 analysts consuming Excel workbooks and Power BI dashboards

Requirements

Build a batch ETL pipeline that ingests CSV/XLSX files from SharePoint and SFTP into a centralized data platform.
Standardize schemas across sources, including date parsing, column normalization, and type enforcement.
Implement data quality checks for nulls, duplicates, invalid product codes, negative quantities, and row-count anomalies.
Store raw and curated datasets separately and maintain auditability from report output back to source file.
Generate analyst-friendly outputs consumable in Microsoft Excel, including a daily quality summary and exception-level detail extract.
Orchestrate hourly loads and daily report refresh with dependency management and retry logic.
Design monitoring for pipeline failures, SLA misses, and quality threshold breaches.

Constraints

Existing stack is Azure-based; prefer managed services over self-hosted infrastructure.
Budget is limited to a small data engineering team and moderate cloud spend.
Business users must continue using Excel as a primary interface.
PII exists in customer files; access must be role-based and auditable.
Some source files arrive late or are re-sent with corrected records, so the pipeline must support idempotent reprocessing and backfills.

Context

You are asked to design a production data pipeline that preserves Excel-based reporting for business users while automating ingestion, validation, transformation, and report delivery.

Scale Requirements

Sources: 3 operational systems exporting CSV/XLSX files to SharePoint and SFTP
Volume: 8 million rows/day total, ~12 GB raw/day
Batch cadence: Hourly ingestion, daily executive quality report by 7:00 AM local time
Latency target: New source files available in reporting tables within 15 minutes
Retention: 2 years raw files, 5 years aggregated quality metrics
Users: 40 analysts consuming Excel workbooks and Power BI dashboards

Requirements

Build a batch ETL pipeline that ingests CSV/XLSX files from SharePoint and SFTP into a centralized data platform.
Standardize schemas across sources, including date parsing, column normalization, and type enforcement.
Implement data quality checks for nulls, duplicates, invalid product codes, negative quantities, and row-count anomalies.
Store raw and curated datasets separately and maintain auditability from report output back to source file.
Generate analyst-friendly outputs consumable in Microsoft Excel, including a daily quality summary and exception-level detail extract.
Orchestrate hourly loads and daily report refresh with dependency management and retry logic.
Design monitoring for pipeline failures, SLA misses, and quality threshold breaches.

Constraints

Existing stack is Azure-based; prefer managed services over self-hosted infrastructure.
Budget is limited to a small data engineering team and moderate cloud spend.
Business users must continue using Excel as a primary interface.
PII exists in customer files; access must be role-based and auditable.
Some source files arrive late or are re-sent with corrected records, so the pipeline must support idempotent reprocessing and backfills.

Context

You are asked to design a production data pipeline that preserves Excel-based reporting for business users while automating ingestion, validation, transformation, and report delivery.

Scale Requirements

Sources: 3 operational systems exporting CSV/XLSX files to SharePoint and SFTP
Volume: 8 million rows/day total, ~12 GB raw/day
Batch cadence: Hourly ingestion, daily executive quality report by 7:00 AM local time
Latency target: New source files available in reporting tables within 15 minutes
Retention: 2 years raw files, 5 years aggregated quality metrics
Users: 40 analysts consuming Excel workbooks and Power BI dashboards

Requirements

Build a batch ETL pipeline that ingests CSV/XLSX files from SharePoint and SFTP into a centralized data platform.
Standardize schemas across sources, including date parsing, column normalization, and type enforcement.
Implement data quality checks for nulls, duplicates, invalid product codes, negative quantities, and row-count anomalies.
Store raw and curated datasets separately and maintain auditability from report output back to source file.
Generate analyst-friendly outputs consumable in Microsoft Excel, including a daily quality summary and exception-level detail extract.
Orchestrate hourly loads and daily report refresh with dependency management and retry logic.
Design monitoring for pipeline failures, SLA misses, and quality threshold breaches.

Constraints

Existing stack is Azure-based; prefer managed services over self-hosted infrastructure.
Budget is limited to a small data engineering team and moderate cloud spend.
Business users must continue using Excel as a primary interface.
PII exists in customer files; access must be role-based and auditable.
Some source files arrive late or are re-sent with corrected records, so the pipeline must support idempotent reprocessing and backfills.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Automate Excel Quality Reporting Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Automate Excel Quality Reporting Pipeline

Context

Scale Requirements

Requirements

Constraints

Automate Excel Quality Reporting Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer