Build Reliable Customer Orders Pipeline

Context

Northstar Retail, a mid-sized e-commerce company, loads order, payment, and refund data from PostgreSQL, Stripe, and CSV files from a 3PL partner into Snowflake for finance and operations reporting. The current pipeline is a set of ad hoc Python cron jobs, and teams regularly find row count mismatches, duplicate orders, and delayed backfills.

You need to design a production-grade batch data pipeline that improves data accuracy and reliability while keeping the stack simple for a small data team.

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), Stripe API, daily CSV drops from S3
Volume: 25M order records/year, 8M payment events/year, 2GB-5GB new raw data/day
Freshness: Core finance tables available in Snowflake within 30 minutes of source updates
Batch cadence: Every 15 minutes for database/API loads; daily for 3PL files
Retention: Raw data for 1 year, curated finance marts for 7 years

Requirements

Design ingestion for PostgreSQL CDC or incremental extracts, Stripe API pulls, and S3 CSV ingestion into a raw layer.
Ensure accuracy and reliability through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
Build transformations for canonical tables such as orders, payments, refunds, and order_financials.
Support backfills for a specific date range without creating duplicates or corrupting downstream tables.
Define orchestration, dependency management, and recovery steps for partial failures.
Expose data quality status and pipeline health to analysts and on-call engineers.

Constraints

Existing stack is AWS + Snowflake; prefer managed services over self-hosted systems.
Team size: 3 data engineers; limited operational bandwidth.
Finance data must be reproducible for month-end close and SOX-style audit reviews.
Budget target: under $15K/month incremental infrastructure cost.
Source systems cannot tolerate heavy read load during peak business hours.

Context

You need to design a production-grade batch data pipeline that improves data accuracy and reliability while keeping the stack simple for a small data team.

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), Stripe API, daily CSV drops from S3
Volume: 25M order records/year, 8M payment events/year, 2GB-5GB new raw data/day
Freshness: Core finance tables available in Snowflake within 30 minutes of source updates
Batch cadence: Every 15 minutes for database/API loads; daily for 3PL files
Retention: Raw data for 1 year, curated finance marts for 7 years

Requirements

Design ingestion for PostgreSQL CDC or incremental extracts, Stripe API pulls, and S3 CSV ingestion into a raw layer.
Ensure accuracy and reliability through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
Build transformations for canonical tables such as orders, payments, refunds, and order_financials.
Support backfills for a specific date range without creating duplicates or corrupting downstream tables.
Define orchestration, dependency management, and recovery steps for partial failures.
Expose data quality status and pipeline health to analysts and on-call engineers.

Constraints

Existing stack is AWS + Snowflake; prefer managed services over self-hosted systems.
Team size: 3 data engineers; limited operational bandwidth.
Finance data must be reproducible for month-end close and SOX-style audit reviews.
Budget target: under $15K/month incremental infrastructure cost.
Source systems cannot tolerate heavy read load during peak business hours.

Context

You need to design a production-grade batch data pipeline that improves data accuracy and reliability while keeping the stack simple for a small data team.

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), Stripe API, daily CSV drops from S3
Volume: 25M order records/year, 8M payment events/year, 2GB-5GB new raw data/day
Freshness: Core finance tables available in Snowflake within 30 minutes of source updates
Batch cadence: Every 15 minutes for database/API loads; daily for 3PL files
Retention: Raw data for 1 year, curated finance marts for 7 years

Requirements

Design ingestion for PostgreSQL CDC or incremental extracts, Stripe API pulls, and S3 CSV ingestion into a raw layer.
Ensure accuracy and reliability through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
Build transformations for canonical tables such as orders, payments, refunds, and order_financials.
Support backfills for a specific date range without creating duplicates or corrupting downstream tables.
Define orchestration, dependency management, and recovery steps for partial failures.
Expose data quality status and pipeline health to analysts and on-call engineers.

Constraints

Existing stack is AWS + Snowflake; prefer managed services over self-hosted systems.
Team size: 3 data engineers; limited operational bandwidth.
Finance data must be reproducible for month-end close and SOX-style audit reviews.
Budget target: under $15K/month incremental infrastructure cost.
Source systems cannot tolerate heavy read load during peak business hours.

Context

You need to design a production-grade batch data pipeline that improves data accuracy and reliability while keeping the stack simple for a small data team.

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), Stripe API, daily CSV drops from S3
Volume: 25M order records/year, 8M payment events/year, 2GB-5GB new raw data/day
Freshness: Core finance tables available in Snowflake within 30 minutes of source updates
Batch cadence: Every 15 minutes for database/API loads; daily for 3PL files
Retention: Raw data for 1 year, curated finance marts for 7 years

Requirements

Design ingestion for PostgreSQL CDC or incremental extracts, Stripe API pulls, and S3 CSV ingestion into a raw layer.
Ensure accuracy and reliability through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
Build transformations for canonical tables such as orders, payments, refunds, and order_financials.
Support backfills for a specific date range without creating duplicates or corrupting downstream tables.
Define orchestration, dependency management, and recovery steps for partial failures.
Expose data quality status and pipeline health to analysts and on-call engineers.

Constraints

Existing stack is AWS + Snowflake; prefer managed services over self-hosted systems.
Team size: 3 data engineers; limited operational bandwidth.
Finance data must be reproducible for month-end close and SOX-style audit reviews.
Budget target: under $15K/month incremental infrastructure cost.
Source systems cannot tolerate heavy read load during peak business hours.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Build Reliable Customer Orders Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Build Reliable Customer Orders Pipeline

Context

Scale Requirements

Requirements

Constraints

Build Reliable Customer Orders Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer