Context
Northstar Retail, a mid-sized e-commerce company, loads order, payment, and refund data from PostgreSQL, Stripe, and CSV files from a 3PL partner into Snowflake for finance and operations reporting. The current pipeline is a set of ad hoc Python cron jobs, and teams regularly find row count mismatches, duplicate orders, and delayed backfills.
You need to design a production-grade batch data pipeline that improves data accuracy and reliability while keeping the stack simple for a small data team.
Scale Requirements
- Sources: PostgreSQL OLTP (~150 tables), Stripe API, daily CSV drops from S3
- Volume: 25M order records/year, 8M payment events/year, 2GB-5GB new raw data/day
- Freshness: Core finance tables available in Snowflake within 30 minutes of source updates
- Batch cadence: Every 15 minutes for database/API loads; daily for 3PL files
- Retention: Raw data for 1 year, curated finance marts for 7 years
Requirements
- Design ingestion for PostgreSQL CDC or incremental extracts, Stripe API pulls, and S3 CSV ingestion into a raw layer.
- Ensure accuracy and reliability through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
- Build transformations for canonical tables such as
orders, payments, refunds, and order_financials.
- Support backfills for a specific date range without creating duplicates or corrupting downstream tables.
- Define orchestration, dependency management, and recovery steps for partial failures.
- Expose data quality status and pipeline health to analysts and on-call engineers.
Constraints
- Existing stack is AWS + Snowflake; prefer managed services over self-hosted systems.
- Team size: 3 data engineers; limited operational bandwidth.
- Finance data must be reproducible for month-end close and SOX-style audit reviews.
- Budget target: under $15K/month incremental infrastructure cost.
- Source systems cannot tolerate heavy read load during peak business hours.