Context
Northstar Health, a multi-region healthcare analytics company, currently lands CSV extracts from transactional systems directly into Snowflake using nightly batch jobs. The platform now needs a governed data lake architecture to support raw retention, replayable ingestion, lower-cost storage, and mixed workloads for BI, data science, and compliance audits.
You are asked to design a modern AWS-based data lake that separates raw, curated, and serving layers while preserving lineage and data quality across batch and near-real-time sources.
Scale Requirements
- Sources: 120 operational databases, 40 SaaS APIs, and 15 event streams
- Ingestion volume: 12 TB/day batch data + 80K events/sec streaming peak
- File/object count: ~9 million new objects/day
- Latency targets: batch data available in curated zone within 2 hours; streaming data queryable within 10 minutes
- Retention: raw zone for 7 years, curated zone for 2 years, serving aggregates indefinite
- Consumers: 300 BI users, 40 data scientists, 25 downstream applications
Requirements
- Design lake zones for raw/bronze, clean/silver, and business-ready/gold datasets.
- Support both CDC/batch ingestion and streaming ingestion with replay capability.
- Enforce schema evolution, partitioning strategy, deduplication, and idempotent loads.
- Implement data quality checks for completeness, freshness, null rates, and referential integrity.
- Orchestrate transformations and backfills without impacting production SLAs.
- Expose curated data to Snowflake and Athena while maintaining a central metadata catalog.
- Provide lineage, auditability, and access controls for PHI-sensitive datasets.
Constraints
- Must run primarily on AWS using managed services where possible.
- Incremental budget cap: $60K/month excluding Snowflake compute.
- Compliance: HIPAA and regional data residency requirements.
- Team size: 5 data engineers, 1 platform engineer; operational simplicity matters.
- Existing consumers depend on nightly warehouse tables and cannot tolerate breaking schema changes.