Context
Northstar Health, a mid-market healthcare SaaS company, wants to start an AI and analytics engagement for customer reporting, churn prediction, and support copilots. Today, data comes from PostgreSQL OLTP databases, Salesforce, Zendesk, and application event logs, but the platform was built for ad hoc BI and not for reliable downstream ML or LLM workloads.
You are asked to evaluate whether the current data platform is ready, and to design the minimum target-state pipeline needed if it is not. Focus on ingestion reliability, data quality, lineage, governance, and serving patterns for both analytics and AI use cases.
Scale Requirements
- Sources: 12 PostgreSQL databases, Salesforce, Zendesk, S3 file drops, and Kafka event streams
- Volume: ~2.5 TB/day raw data, 180M CDC row changes/day, 40K events/sec peak
- Latency targets: <15 minutes for analytics tables, <5 minutes for operational AI features, daily backfills up to 2 years
- Storage: 900 TB historical retention in object storage, 3 years warehouse retention
- Consumers: 80 BI users, 12 data scientists, 4 production AI services
Requirements
- Define the readiness criteria you would assess across ingestion, transformation, modeling, quality, governance, and observability.
- Design a pipeline that supports both batch and near-real-time data products for analytics and AI features.
- Show how you would validate source completeness, schema stability, freshness, and business-rule correctness before data is used by models.
- Include orchestration for incremental loads, dependency management, retries, and backfills.
- Explain how PII handling, access control, and auditability would work for regulated healthcare data.
- Describe how curated datasets would be exposed to BI, feature generation, and LLM/RAG indexing workflows.
Constraints
- Existing stack is AWS-centric and already uses S3, Airflow, and Snowflake
- Team has 5 data engineers and 1 platform engineer; operational complexity must stay moderate
- Incremental budget cap is $35K/month
- Compliance requirements include HIPAA, row-level access controls, and deletion/audit workflows
- Source teams frequently introduce undocumented schema changes