Context
Northstar Health, a mid-sized healthcare SaaS company, generates operational and finance reports from PostgreSQL application databases, Salesforce, and Stripe. Today, nightly ETL jobs load data into Snowflake, but report discrepancies, duplicate records, and silent schema changes have reduced stakeholder trust.
You are asked to redesign the reporting pipeline so reports are accurate, auditable, and resilient to upstream data issues while keeping daily business reporting available by 7:00 AM.
Scale Requirements
- Sources: 3 primary systems: PostgreSQL (~250 tables), Salesforce (~40 objects), Stripe (~15M transactions total)
- Daily volume: ~120M changed rows/day, ~450 GB raw ingest/day
- Latency target: Critical finance and operations tables available in Snowflake within 30 minutes of source extraction; executive dashboards refreshed by 7:00 AM UTC
- Retention: Raw immutable history for 1 year; curated reporting tables for 5 years
- Data quality SLA: <0.1% invalid records in curated tables; 100% reconciliation coverage for finance metrics
Requirements
- Design an ETL/ELT pipeline that ingests incremental changes from PostgreSQL, Salesforce, and Stripe into Snowflake.
- Ensure data integrity and accuracy using schema validation, deduplication, referential integrity checks, reconciliation against source totals, and audit trails.
- Support idempotent reruns and backfills for any date range without creating duplicate facts.
- Build curated reporting models for revenue, subscriptions, invoices, and customer activity.
- Add orchestration, dependency management, and data quality gates so failed validations block downstream report publication.
- Define monitoring, alerting, and failure recovery procedures for extraction failures, late-arriving data, and schema drift.
Constraints
- Must run primarily on AWS using existing Snowflake and Airflow investments
- Team size: 3 data engineers, 1 analytics engineer
- Compliance: SOC 2 and HIPAA-adjacent controls; auditability is mandatory
- Budget: Prefer managed services; avoid introducing more than one new major platform
- Reports consumed by finance leadership require reproducible numbers and row-level lineage for material metrics