
Large analytical tables can make even correct SQL perform poorly if the query scans unnecessary data, joins inefficiently, or aggregates at the wrong stage. Interviewers want to hear how you reason about both SQL structure and execution behavior.
You are asked to explain how you would optimize a query that is running slowly on a multi-terabyte analytics table, such as a Google Ads impression or click fact table. Describe how you would diagnose the bottleneck, rewrite the SQL to reduce work, and decide whether to change table design, indexing, partitioning, or pre-aggregation. You should also explain how you would validate that the optimized query is still correct.
Go beyond generic advice like "add an index" or "use partitioning." A strong answer should cover query-shape improvements, join and aggregation strategy, filtering selectivity, execution-plan analysis, and trade-offs between one-time SQL rewrites and longer-term data-model changes.
The first step is to inspect how PostgreSQL executes the query using EXPLAIN or EXPLAIN ANALYZE. This shows whether the main cost comes from sequential scans, expensive joins, sorts, hash aggregation, or poor row-count estimates.
EXPLAIN ANALYZE
SELECT campaign_id, COUNT(*)
FROM ad_impressions
WHERE event_date >= DATE '2024-01-01'
GROUP BY campaign_id;
Filtering and pre-aggregating as early as possible usually lowers the amount of data carried into joins, sorts, and window functions. On very large tables, pushing selective predicates into the earliest possible stage can change the execution plan dramatically.
WITH filtered AS (
SELECT campaign_id, user_id
FROM ad_impressions
WHERE event_date >= DATE '2024-01-01'
AND event_date < DATE '2024-02-01'
)
SELECT campaign_id, COUNT(*)
FROM filtered
GROUP BY campaign_id;
Slow queries often come from joining large fact tables before reducing them, or from joining on non-selective keys that multiply rows unexpectedly. A good optimization answer should explain when to aggregate before joining and how to avoid unnecessary many-to-many expansion.
WITH clicks_by_campaign AS (
SELECT campaign_id, COUNT(*) AS click_count
FROM ad_clicks
WHERE click_date >= DATE '2024-01-01'
GROUP BY campaign_id
)
SELECT c.campaign_name, cb.click_count
FROM clicks_by_campaign cb
JOIN campaigns c
ON cb.campaign_id = c.campaign_id;
When SQL rewrites are not enough, physical design matters. Partitioning helps prune large date-based tables, while well-chosen indexes support selective filters, joins, and ordered access paths.
CREATE INDEX idx_ad_impressions_event_date_campaign
ON ad_impressions (event_date, campaign_id);
Optimization is not complete until you verify both speed and correctness. You should compare row counts, aggregates, and edge cases between the old and new query, and measure performance improvements with the same filters and realistic data volume.
WITH old_result AS (
SELECT campaign_id, COUNT(*) AS cnt
FROM ad_impressions
WHERE event_date >= DATE '2024-01-01'
GROUP BY campaign_id
),
new_result AS (
SELECT campaign_id, COUNT(*) AS cnt
FROM ad_impressions
WHERE event_date >= DATE '2024-01-01'
GROUP BY campaign_id
)
SELECT *
FROM old_result
EXCEPT
SELECT *
FROM new_result;