
Uber relies on accurate driver location data to power dispatch, ETA estimation, and marketplace monitoring. In practice, GPS pings can be incomplete, delayed, or malformed, so analysts need a clear SQL-based approach to handling missing values.
Explain how you would handle missing values in a driver location dataset stored in PostgreSQL. Your answer should cover:
NULL latitude/longitude, empty strings in source fields, or obviously invalid coordinates.The interviewer is not looking for advanced modeling or geospatial interpolation. Focus on practical SQL data manipulation: profiling the data, standardizing missing values, creating quality flags, and explaining trade-offs between excluding bad rows and filling defaults.
Missing data is not always stored as NULL. In raw ingestion tables, it may appear as empty strings, placeholder text like 'unknown', or impossible coordinate values. A good SQL approach first normalizes these cases into a consistent representation so later logic is reliable.
SELECT
CASE
WHEN latitude IS NULL OR longitude IS NULL THEN 'missing'
WHEN latitude NOT BETWEEN -90 AND 90 THEN 'invalid'
WHEN longitude NOT BETWEEN -180 AND 180 THEN 'invalid'
ELSE 'valid'
END AS location_status
FROM driver_location_events;
Before fixing anything, measure how much data is affected. SQL aggregations help calculate counts and percentages of missing or invalid rows by city, date, or event source so the team can assess severity and prioritize remediation.
SELECT
city_id,
COUNT(*) AS total_rows,
SUM(CASE WHEN latitude IS NULL OR longitude IS NULL THEN 1 ELSE 0 END) AS missing_rows
FROM driver_location_events
GROUP BY city_id;
There is no single correct treatment for missing values. For metrics that require trustworthy coordinates, filtering may be best; for monitoring pipelines, flagging preserves row counts; for some reporting use cases, imputation may be acceptable if clearly documented.
SELECT
driver_id,
event_ts,
COALESCE(latitude, 0) AS latitude_filled,
COALESCE(longitude, 0) AS longitude_filled
FROM driver_location_events;
The right SQL logic depends on how the data will be used. Dispatch and ETA systems are sensitive to bad coordinates and usually require strict filtering, while operational dashboards may keep rows and add a data quality label to avoid hiding ingestion issues.
SELECT
driver_id,
event_ts,
CASE
WHEN latitude IS NULL OR longitude IS NULL THEN 'exclude_from_dispatch'
ELSE 'usable'
END AS downstream_action
FROM driver_location_events;