Context
GSK needs a unified pipeline to collect and analyze clinical trial, pharmacovigilance, and manufacturing telemetry data for near-real-time operational reporting and downstream analytics. Today, several teams load CSV extracts and API pulls into separate stores on different schedules, creating 12-24 hour delays, inconsistent schemas, and weak lineage across GSK’s analytics estate.
You are asked to design a modern pipeline that standardizes ingestion, validation, transformation, and serving into GSK’s enterprise analytics platform while supporting both batch and streaming use cases.
Scale Requirements
- Sources: ~120 upstream systems (EDC exports, lab APIs, safety systems, manufacturing sensors)
- Throughput: 40K events/second peak streaming telemetry; 8 TB/day batch ingest
- Latency: < 2 minutes for streaming data to become queryable; < 45 minutes for batch loads
- Storage: 3 PB historical retention in raw zone; curated data retained for 7+ years
- Concurrency: 300+ daily scheduled workflows, 50+ concurrent transformations
Requirements
- Design ingestion for mixed sources: SFTP file drops, REST APIs, and Kafka-based event streams.
- Build raw, validated, and curated layers in a lakehouse architecture used by GSK analytics teams.
- Enforce schema validation, deduplication, idempotent reprocessing, and lineage for regulated datasets.
- Support both incremental ELT for analytical models and stream processing for operational monitoring.
- Orchestrate dependencies across ingestion, quality checks, transformations, and downstream publishing.
- Provide monitoring for freshness, volume anomalies, failed loads, and data quality SLA breaches.
- Describe how you would backfill 6 months of historical data without disrupting current loads.
Constraints
- Must align to GxP/GDPR controls, auditability, and role-based access requirements.
- Prefer GSK-standard cloud data platform components over introducing many new tools.
- Team size is 5 data engineers and 1 platform engineer; operational simplicity matters.
- Budget allows moderate autoscaling, but always-on oversized clusters are discouraged.
- Some source systems are unreliable and may resend files or events out of order.