You’re a data engineer at a large e-commerce marketplace (~20M registered customers, millions of daily sessions). Multiple upstream systems (web signup, mobile app, call center) write to a shared customer profile store. Due to intermittent API timeouts and partial updates, the customer_profiles table often contains missing values (NULL or empty strings) for key fields like email and phone.
Marketing and fraud teams rely on these fields for order confirmations, account recovery, and risk checks. If missing data isn’t cleaned consistently, the business sees higher support costs, lower deliverability, and weaker fraud detection.
Write a SQL query that produces a cleaned view of customer contact data by imputing missing values using the most recent non-missing value available for the same customer.
customer_id.customer_idclean_emailclean_phonelast_updated_at (the latest profile update timestamp for that customer)updated_at.customer_id ascending.customer_profiles| Column | Type | Description |
|---|---|---|
| profile_event_id | BIGINT | Primary key for each profile update event |
| customer_id | BIGINT | Customer identifier (many events per customer) |
| VARCHAR(255) | Email captured in this update (may be NULL/empty) | |
| phone | VARCHAR(50) | Phone captured in this update (may be NULL/empty) |
| updated_at | TIMESTAMP | When this profile update event was written |
customer_profiles| profile_event_id | customer_id | phone | updated_at | |
|---|---|---|---|---|
| 9001 | 101 | 415-555-0101 | 2025-01-03 10:05:00 | |
| 9002 | 101 | ava.chen@example.com | 2025-01-10 09:20:00 | |
| 9003 | 102 | NULL | 2025-01-02 08:00:00 | |
| 9004 | 102 | 212-555-0199 | 2025-01-05 12:30:00 | |
| 9005 | 103 | mia.patel@example.com | 646-555-0144 | 2025-01-04 14:10:00 |
| customer_id | clean_email | clean_phone | last_updated_at |
|---|---|---|---|
| 101 | ava.chen@example.com | 415-555-0101 | 2025-01-10 09:20:00 |
| 102 | NULL | 212-555-0199 | 2025-01-05 12:30:00 |
| 103 | mia.patel@example.com | 646-555-0144 | 2025-01-04 14:10:00 |