





Outliers can heavily affect averages, thresholds, and business reporting. In SQL workflows, analysts are often expected to identify unusual values and decide whether to remove, cap, or flag them before further analysis.
Explain how you would detect and handle outliers in a dataset using SQL. Your answer should cover:
The interviewer expects a practical conceptual answer: describe the methods, trade-offs, and how you would implement them in SQL at a high level. You do not need advanced statistical theory, but you should show sound judgment about data quality and business context.
The simplest approach is to define business rules that mark values outside an acceptable range as outliers. This works well when the domain already provides natural limits, such as negative prices, impossible ages, or transaction amounts above a known cap.
SELECT order_id, amount,
CASE WHEN amount < 0 OR amount > 10000 THEN 'outlier' ELSE 'normal' END AS outlier_flag
FROM orders;
A common statistical approach is to compare each value to the dataset mean and standard deviation. Values far from the mean, often more than 2 or 3 standard deviations away, can be flagged as potential outliers, though this method is sensitive to skewed data.
SELECT value,
CASE
WHEN ABS(value - avg_value) > 3 * stddev_value THEN 'outlier'
ELSE 'normal'
END AS outlier_flag
FROM (
SELECT value,
AVG(value) OVER () AS avg_value,
STDDEV_SAMP(value) OVER () AS stddev_value
FROM measurements
) t;
Percentile-based methods are often more robust than mean-based methods because they are less affected by extreme values. A common technique is the interquartile range (IQR), where values below Q1 - 1.5IQR or above Q3 + 1.5IQR are flagged.
SELECT value,
CASE
WHEN value < q1 - 1.5 * (q3 - q1) OR value > q3 + 1.5 * (q3 - q1) THEN 'outlier'
ELSE 'normal'
END AS outlier_flag
FROM (
SELECT value,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) OVER () AS q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) OVER () AS q3
FROM measurements
) t;
Detecting an outlier is only the first step; the next decision is whether to remove it, cap it, or keep it with a flag. The correct choice depends on whether the value is a data error, a rare but valid event, or a signal that deserves separate analysis.
SELECT customer_id,
amount,
CASE WHEN amount > 5000 THEN 5000 ELSE amount END AS capped_amount,
CASE WHEN amount > 5000 THEN true ELSE false END AS was_capped
FROM orders;
Outlier handling should be documented and reproducible because it changes downstream metrics. In interviews, it is important to explain that you would preserve the raw value, add a flag, and make the transformation explicit rather than silently deleting rows.
SELECT id,
raw_amount,
CASE WHEN raw_amount > 5000 THEN true ELSE false END AS is_outlier,
CASE WHEN raw_amount > 5000 THEN 5000 ELSE raw_amount END AS adjusted_amount
FROM transactions;