Context
NovaTrials runs a continuous clinical outcomes study across 120 research sites. Site coordinators currently upload CSV extracts from EDC systems and lab vendors, and an analyst manually cleans, joins, and refreshes weekly reporting tables in Snowflake; the process takes 6-8 hours per refresh and frequently introduces versioning errors.
You need to design an automated reporting pipeline that produces validated study metrics for operations and medical teams with same-day freshness, while preserving auditability and reproducibility.
Scale Requirements
- Sources: 120 site-level EDC exports, 3 lab vendors, 1 patient engagement platform
- Volume: ~15M new records/month, ~250 GB raw compressed data/month
- Batch cadence: Hourly ingestion, daily reporting refresh by 6:00 AM UTC
- Latency target: Source file arrival to curated reporting tables in < 90 minutes
- Retention: 7 years for raw and curated datasets
- Data quality target: < 0.5% invalid records in curated tables
Requirements
- Design an ingestion layer for CSV, JSON, and SFTP-delivered files from external study partners.
- Build an idempotent ETL pipeline that standardizes schemas, deduplicates records, and handles late-arriving corrections.
- Model study entities such as patient enrollment, visits, adverse events, and lab results into analytics-ready tables.
- Orchestrate hourly loads and a daily reporting build with dependency management and backfill support.
- Implement data quality checks for missing patient IDs, invalid visit dates, duplicate lab results, and referential integrity across study tables.
- Support reprocessing of historical files without double-counting metrics.
- Provide monitoring, alerting, and an audit trail for every file, transformation run, and published report snapshot.
Constraints
- Existing stack is AWS + Snowflake; avoid introducing more than one new major platform.
- Team size is 3 data engineers and 1 analytics engineer.
- Study data is regulated: HIPAA controls, encrypted storage, role-based access, and immutable load history are required.
- Budget allows moderate warehouse scaling, but compute-heavy full reloads should be avoided.