Design Clinical Data Collection Pipeline

Context

GSK needs a unified pipeline to collect and analyze clinical trial, pharmacovigilance, and manufacturing telemetry data for near-real-time operational reporting and downstream analytics. Today, several teams load CSV extracts and API pulls into separate stores on different schedules, creating 12-24 hour delays, inconsistent schemas, and weak lineage across GSK’s analytics estate.

You are asked to design a modern pipeline that standardizes ingestion, validation, transformation, and serving into GSK’s enterprise analytics platform while supporting both batch and streaming use cases.

Scale Requirements

Sources: ~120 upstream systems (EDC exports, lab APIs, safety systems, manufacturing sensors)
Throughput: 40K events/second peak streaming telemetry; 8 TB/day batch ingest
Latency: < 2 minutes for streaming data to become queryable; < 45 minutes for batch loads
Storage: 3 PB historical retention in raw zone; curated data retained for 7+ years
Concurrency: 300+ daily scheduled workflows, 50+ concurrent transformations

Requirements

Design ingestion for mixed sources: SFTP file drops, REST APIs, and Kafka-based event streams.
Build raw, validated, and curated layers in a lakehouse architecture used by GSK analytics teams.
Enforce schema validation, deduplication, idempotent reprocessing, and lineage for regulated datasets.
Support both incremental ELT for analytical models and stream processing for operational monitoring.
Orchestrate dependencies across ingestion, quality checks, transformations, and downstream publishing.
Provide monitoring for freshness, volume anomalies, failed loads, and data quality SLA breaches.
Describe how you would backfill 6 months of historical data without disrupting current loads.

Constraints

Must align to GxP/GDPR controls, auditability, and role-based access requirements.
Prefer GSK-standard cloud data platform components over introducing many new tools.
Team size is 5 data engineers and 1 platform engineer; operational simplicity matters.
Budget allows moderate autoscaling, but always-on oversized clusters are discouraged.
Some source systems are unreliable and may resend files or events out of order.

Context

Scale Requirements

Sources: ~120 upstream systems (EDC exports, lab APIs, safety systems, manufacturing sensors)
Throughput: 40K events/second peak streaming telemetry; 8 TB/day batch ingest
Latency: < 2 minutes for streaming data to become queryable; < 45 minutes for batch loads
Storage: 3 PB historical retention in raw zone; curated data retained for 7+ years
Concurrency: 300+ daily scheduled workflows, 50+ concurrent transformations

Requirements

Design ingestion for mixed sources: SFTP file drops, REST APIs, and Kafka-based event streams.
Build raw, validated, and curated layers in a lakehouse architecture used by GSK analytics teams.
Enforce schema validation, deduplication, idempotent reprocessing, and lineage for regulated datasets.
Support both incremental ELT for analytical models and stream processing for operational monitoring.
Orchestrate dependencies across ingestion, quality checks, transformations, and downstream publishing.
Provide monitoring for freshness, volume anomalies, failed loads, and data quality SLA breaches.
Describe how you would backfill 6 months of historical data without disrupting current loads.

Constraints

Must align to GxP/GDPR controls, auditability, and role-based access requirements.
Prefer GSK-standard cloud data platform components over introducing many new tools.
Team size is 5 data engineers and 1 platform engineer; operational simplicity matters.
Budget allows moderate autoscaling, but always-on oversized clusters are discouraged.
Some source systems are unreliable and may resend files or events out of order.

Context

Scale Requirements

Sources: ~120 upstream systems (EDC exports, lab APIs, safety systems, manufacturing sensors)
Throughput: 40K events/second peak streaming telemetry; 8 TB/day batch ingest
Latency: < 2 minutes for streaming data to become queryable; < 45 minutes for batch loads
Storage: 3 PB historical retention in raw zone; curated data retained for 7+ years
Concurrency: 300+ daily scheduled workflows, 50+ concurrent transformations

Requirements

Design ingestion for mixed sources: SFTP file drops, REST APIs, and Kafka-based event streams.
Build raw, validated, and curated layers in a lakehouse architecture used by GSK analytics teams.
Enforce schema validation, deduplication, idempotent reprocessing, and lineage for regulated datasets.
Support both incremental ELT for analytical models and stream processing for operational monitoring.
Orchestrate dependencies across ingestion, quality checks, transformations, and downstream publishing.
Provide monitoring for freshness, volume anomalies, failed loads, and data quality SLA breaches.
Describe how you would backfill 6 months of historical data without disrupting current loads.

Constraints

Must align to GxP/GDPR controls, auditability, and role-based access requirements.
Prefer GSK-standard cloud data platform components over introducing many new tools.
Team size is 5 data engineers and 1 platform engineer; operational simplicity matters.
Budget allows moderate autoscaling, but always-on oversized clusters are discouraged.
Some source systems are unreliable and may resend files or events out of order.

Context

Scale Requirements

Sources: ~120 upstream systems (EDC exports, lab APIs, safety systems, manufacturing sensors)
Throughput: 40K events/second peak streaming telemetry; 8 TB/day batch ingest
Latency: < 2 minutes for streaming data to become queryable; < 45 minutes for batch loads
Storage: 3 PB historical retention in raw zone; curated data retained for 7+ years
Concurrency: 300+ daily scheduled workflows, 50+ concurrent transformations

Requirements

Design ingestion for mixed sources: SFTP file drops, REST APIs, and Kafka-based event streams.
Build raw, validated, and curated layers in a lakehouse architecture used by GSK analytics teams.
Enforce schema validation, deduplication, idempotent reprocessing, and lineage for regulated datasets.
Support both incremental ELT for analytical models and stream processing for operational monitoring.
Orchestrate dependencies across ingestion, quality checks, transformations, and downstream publishing.
Provide monitoring for freshness, volume anomalies, failed loads, and data quality SLA breaches.
Describe how you would backfill 6 months of historical data without disrupting current loads.

Constraints

Must align to GxP/GDPR controls, auditability, and role-based access requirements.
Prefer GSK-standard cloud data platform components over introducing many new tools.
Team size is 5 data engineers and 1 platform engineer; operational simplicity matters.
Budget allows moderate autoscaling, but always-on oversized clusters are discouraged.
Some source systems are unreliable and may resend files or events out of order.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Clinical Data Collection Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Clinical Data Collection Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Clinical Data Collection Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer