Context
HealthPulse, a digital health insurer, is launching a new predictive analytics project to forecast member churn and care-risk scores. Today, source data from claims, CRM, eligibility, and mobile engagement systems lands through ad hoc batch jobs into Snowflake with inconsistent schemas, limited lineage, and no formal data quality gates.
You need to design a governed data pipeline that scopes the project correctly from day one: define trusted datasets, enforce quality checks before model consumption, and provide auditable lineage for regulated healthcare analytics.
Scale Requirements
- Sources: 12 upstream systems (FHIR APIs, PostgreSQL, SFTP CSV drops, Kafka event stream)
- Volume: 1.2 TB/day raw ingest, ~8 billion records total historical backfill
- Freshness: critical features available within 30 minutes; non-critical dimensions within 6 hours
- Batch windows: nightly backfill up to 4 hours; incremental loads every 15 minutes
- Retention: 7 years for curated data, 90 days for raw landing zone
- Consumers: 3 ML models, 40 analysts, 6 downstream BI dashboards
Requirements
- Design an ingestion and transformation pipeline that supports both historical backfill and incremental processing.
- Define how raw, validated, curated, and feature-ready datasets are separated and versioned.
- Implement data quality controls for schema drift, null rates, referential integrity, duplicate records, and freshness.
- Enforce governance: lineage, ownership, PII tagging, access control, and auditability for HIPAA-sensitive fields.
- Specify how model-ready tables are certified before use by data scientists.
- Describe orchestration, retry behavior, and rollback for failed loads or bad upstream data.
- Include monitoring for pipeline health, data quality regressions, and SLA breaches.
Constraints
- Cloud stack is AWS + Snowflake; avoid introducing more than one new major platform.
- Team has strong SQL/dbt skills but limited streaming expertise.
- Budget cap is $40K/month incremental infrastructure spend.
- HIPAA compliance requires field-level access controls, immutable audit logs, and reproducible training datasets.