Build Customer Data Cleaning Pipeline

Context

Northstar Retail receives customer and order data from its e-commerce app, CRM, and payment platform. Today, analysts manually clean CSV extracts before loading them into Snowflake, causing inconsistent metrics, duplicate records, and 1-day reporting delays.

You need to design a production-grade batch pipeline that standardizes, validates, and prepares this data for downstream analysis. The company wants a repeatable process that improves data quality while remaining simple enough for a small data team to operate.

Scale Requirements

Sources: Shopify exports, Salesforce CRM dumps, Stripe transaction files, and internal PostgreSQL tables
Volume: ~120 GB/day raw data, ~45 million records/day across all sources
Batch frequency: Hourly ingestion, daily backfill support for up to 2 years
Latency target: Data available in analytics tables within 30 minutes of file arrival
Retention: Raw data for 1 year, curated tables for 5 years
Data quality target: <0.5% invalid records in curated tables

Requirements

Design an ingestion and cleaning pipeline for structured CSV/JSON data from multiple systems.
Standardize schemas, data types, timestamps, currencies, and null handling across sources.
Detect and handle duplicates, malformed rows, missing required fields, and referential integrity issues.
Create raw, cleaned, and analytics-ready layers in Snowflake.
Support idempotent reruns and historical backfills without creating duplicate outputs.
Orchestrate hourly loads, dependency management, and quality checks.
Provide monitoring for freshness, row-count anomalies, and failed validations.

Constraints

Team: 3 data engineers, no dedicated platform team
Infrastructure: AWS already in use; Snowflake is the analytics warehouse
Budget: Prefer managed services over large always-on clusters
Compliance: PII includes email, phone, and billing address; access must be auditable
Operational requirement: Analysts must be able to trace every curated record back to its raw source file

Context

Scale Requirements

Sources: Shopify exports, Salesforce CRM dumps, Stripe transaction files, and internal PostgreSQL tables
Volume: ~120 GB/day raw data, ~45 million records/day across all sources
Batch frequency: Hourly ingestion, daily backfill support for up to 2 years
Latency target: Data available in analytics tables within 30 minutes of file arrival
Retention: Raw data for 1 year, curated tables for 5 years
Data quality target: <0.5% invalid records in curated tables

Requirements

Design an ingestion and cleaning pipeline for structured CSV/JSON data from multiple systems.
Standardize schemas, data types, timestamps, currencies, and null handling across sources.
Detect and handle duplicates, malformed rows, missing required fields, and referential integrity issues.
Create raw, cleaned, and analytics-ready layers in Snowflake.
Support idempotent reruns and historical backfills without creating duplicate outputs.
Orchestrate hourly loads, dependency management, and quality checks.
Provide monitoring for freshness, row-count anomalies, and failed validations.

Constraints

Team: 3 data engineers, no dedicated platform team
Infrastructure: AWS already in use; Snowflake is the analytics warehouse
Budget: Prefer managed services over large always-on clusters
Compliance: PII includes email, phone, and billing address; access must be auditable
Operational requirement: Analysts must be able to trace every curated record back to its raw source file

Context

Scale Requirements

Sources: Shopify exports, Salesforce CRM dumps, Stripe transaction files, and internal PostgreSQL tables
Volume: ~120 GB/day raw data, ~45 million records/day across all sources
Batch frequency: Hourly ingestion, daily backfill support for up to 2 years
Latency target: Data available in analytics tables within 30 minutes of file arrival
Retention: Raw data for 1 year, curated tables for 5 years
Data quality target: <0.5% invalid records in curated tables

Requirements

Design an ingestion and cleaning pipeline for structured CSV/JSON data from multiple systems.
Standardize schemas, data types, timestamps, currencies, and null handling across sources.
Detect and handle duplicates, malformed rows, missing required fields, and referential integrity issues.
Create raw, cleaned, and analytics-ready layers in Snowflake.
Support idempotent reruns and historical backfills without creating duplicate outputs.
Orchestrate hourly loads, dependency management, and quality checks.
Provide monitoring for freshness, row-count anomalies, and failed validations.

Constraints

Team: 3 data engineers, no dedicated platform team
Infrastructure: AWS already in use; Snowflake is the analytics warehouse
Budget: Prefer managed services over large always-on clusters
Compliance: PII includes email, phone, and billing address; access must be auditable
Operational requirement: Analysts must be able to trace every curated record back to its raw source file

Context

Scale Requirements

Sources: Shopify exports, Salesforce CRM dumps, Stripe transaction files, and internal PostgreSQL tables
Volume: ~120 GB/day raw data, ~45 million records/day across all sources
Batch frequency: Hourly ingestion, daily backfill support for up to 2 years
Latency target: Data available in analytics tables within 30 minutes of file arrival
Retention: Raw data for 1 year, curated tables for 5 years
Data quality target: <0.5% invalid records in curated tables

Requirements

Design an ingestion and cleaning pipeline for structured CSV/JSON data from multiple systems.
Standardize schemas, data types, timestamps, currencies, and null handling across sources.
Detect and handle duplicates, malformed rows, missing required fields, and referential integrity issues.
Create raw, cleaned, and analytics-ready layers in Snowflake.
Support idempotent reruns and historical backfills without creating duplicate outputs.
Orchestrate hourly loads, dependency management, and quality checks.
Provide monitoring for freshness, row-count anomalies, and failed validations.

Constraints

Team: 3 data engineers, no dedicated platform team
Infrastructure: AWS already in use; Snowflake is the analytics warehouse
Budget: Prefer managed services over large always-on clusters
Compliance: PII includes email, phone, and billing address; access must be auditable
Operational requirement: Analysts must be able to trace every curated record back to its raw source file

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Build Customer Data Cleaning Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Build Customer Data Cleaning Pipeline

Context

Scale Requirements

Requirements

Constraints

Build Customer Data Cleaning Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer