Business Context
A digital health research organization runs dozens of IRB-approved studies across mobile apps and clinics, collecting data from surveys, wearable sensors, and EHR extracts. At peak, the platform ingests millions of sensor events per day, and study teams must justify their data collection methodology in quarterly governance reviews (cost, participant burden, and data quality).
The research ops team wants a standardized SQL report that summarizes, per study, which collection methods were used, how much data was collected, and whether a method was the primary source of data for that study.
Task
Write a SQL query that produces a per-study breakdown of data collection methodologies for January 2025, using the event-level collection logs.
Requirements
Return one row per (study_id, method) for events whose collected_at falls in [2025-01-01, 2025-02-01).
Your output must include:
study_id
study_name
method (e.g., survey, wearable, ehr)
events_collected = number of collection events in the month
unique_participants = distinct participants contributing events via that method
pct_of_study_events = events_collected / total_events_in_study_in_month (as a decimal)
method_rank_in_study = rank methods within each study by events_collected (1 = most events). Break ties by method ascending.
Order the final results by study_id, then method_rank_in_study, then method.
Table Definitions
studies
| column | type | description |
|---|
| study_id | INT | Primary key for the study |
| study_name | VARCHAR(200) | Human-readable study name |
| therapeutic_area | VARCHAR(100) | Area (e.g., cardiology, diabetes) |
| start_date | DATE | Study start date |
participants
| column | type | description |
|---|
| participant_id | INT | Primary key for participant |
| study_id | INT | FK to studies.study_id |
| enrolled_at | TIMESTAMP | Enrollment timestamp |
| status | VARCHAR(30) | e.g., active, withdrawn |
collection_events
| column | type | description |
|---|
| event_id | BIGINT | Primary key for collection event |
| participant_id | INT | FK to participants.participant_id |
| method | VARCHAR(30) | Data collection method (survey, wearable, ehr) |
| collected_at | TIMESTAMP | When the data was collected/ingested |
| payload_bytes | BIGINT | Size of the collected payload |
| is_valid | BOOLEAN | Whether the event passed validation |
Sample Data
studies
| study_id | study_name | therapeutic_area | start_date |
|---|
| 101 | Heart Health Remote Monitoring | Cardiology | 2024-11-15 |
| 102 | Diabetes Lifestyle Coaching | Endocrinology | 2024-12-01 |
participants
| participant_id | study_id | enrolled_at | status |
|---|
| 1 | 101 | 2024-12-20 09:00:00 | active |
| 2 | 101 | 2024-12-22 10:30:00 | active |
| 3 | 102 | 2025-01-05 08:15:00 | active |
| 4 | 102 | 2025-01-10 11:00:00 | withdrawn |
collection_events
| event_id | participant_id | method | collected_at | payload_bytes | is_valid |
|---|
| 9001 | 1 | wearable | 2025-01-03 12:00:00 | 2048 | true |
| 9002 | 1 | survey | 2025-01-03 12:05:00 | 512 | true |
| 9003 | 2 | wearable | 2025-01-04 09:00:00 | 4096 | true |
| 9004 | 3 | ehr | 2025-01-15 14:00:00 | 10240 | true |
| 9005 | 3 | survey | 2025-01-16 09:30:00 | 512 | true |
Expected Output (for sample data)
| study_id | study_name | method | events_collected | unique_participants | pct_of_study_events | method_rank_in_study |
|---|
| 101 | Heart Health Remote Monitoring | wearable | 2 | 2 | 0.6666666667 | 1 |
| 101 | Heart Health Remote Monitoring | survey | 1 | 1 | 0.3333333333 | 2 |
| 102 | Diabetes Lifestyle Coaching | ehr | 1 | 1 | 0.5000000000 | 1 |
| 102 | Diabetes Lifestyle Coaching | survey | 1 | 1 | 0.5000000000 | 2 |