Optimizing Large Analytical SQL Queries | Dataford Interview Questions

Problem

Context

Large analytical tables can make even correct SQL perform poorly if the query scans unnecessary data, joins inefficiently, or aggregates at the wrong stage. Interviewers want to hear how you reason about both SQL structure and execution behavior.

Question

You are asked to explain how you would optimize a query that is running slowly on a multi-terabyte analytics table, such as a Google Ads impression or click fact table. Describe how you would diagnose the bottleneck, rewrite the SQL to reduce work, and decide whether to change table design, indexing, partitioning, or pre-aggregation. You should also explain how you would validate that the optimized query is still correct.

Scope guidance

Go beyond generic advice like "add an index" or "use partitioning." A strong answer should cover query-shape improvements, join and aggregation strategy, filtering selectivity, execution-plan analysis, and trade-offs between one-time SQL rewrites and longer-term data-model changes.

Execution plan analysis

The first step is to inspect how PostgreSQL executes the query using EXPLAIN or EXPLAIN ANALYZE. This shows whether the main cost comes from sequential scans, expensive joins, sorts, hash aggregation, or poor row-count estimates.

EXPLAIN ANALYZE
SELECT campaign_id, COUNT(*)
FROM ad_impressions
WHERE event_date >= DATE '2024-01-01'
GROUP BY campaign_id;

Reduce data early

Filtering and pre-aggregating as early as possible usually lowers the amount of data carried into joins, sorts, and window functions. On very large tables, pushing selective predicates into the earliest possible stage can change the execution plan dramatically.

WITH filtered AS (
  SELECT campaign_id, user_id
  FROM ad_impressions
  WHERE event_date >= DATE '2024-01-01'
    AND event_date < DATE '2024-02-01'
)
SELECT campaign_id, COUNT(*)
FROM filtered
GROUP BY campaign_id;

Join strategy and join cardinality

Slow queries often come from joining large fact tables before reducing them, or from joining on non-selective keys that multiply rows unexpectedly. A good optimization answer should explain when to aggregate before joining and how to avoid unnecessary many-to-many expansion.

WITH clicks_by_campaign AS (
  SELECT campaign_id, COUNT(*) AS click_count
  FROM ad_clicks
  WHERE click_date >= DATE '2024-01-01'
  GROUP BY campaign_id
)
SELECT c.campaign_name, cb.click_count
FROM clicks_by_campaign cb
JOIN campaigns c
  ON cb.campaign_id = c.campaign_id;

Partitioning, indexing, and physical design

When SQL rewrites are not enough, physical design matters. Partitioning helps prune large date-based tables, while well-chosen indexes support selective filters, joins, and ordered access paths.

CREATE INDEX idx_ad_impressions_event_date_campaign
ON ad_impressions (event_date, campaign_id);

Correctness and measurement

Optimization is not complete until you verify both speed and correctness. You should compare row counts, aggregates, and edge cases between the old and new query, and measure performance improvements with the same filters and realistic data volume.

WITH old_result AS (
  SELECT campaign_id, COUNT(*) AS cnt
  FROM ad_impressions
  WHERE event_date >= DATE '2024-01-01'
  GROUP BY campaign_id
),
new_result AS (
  SELECT campaign_id, COUNT(*) AS cnt
  FROM ad_impressions
  WHERE event_date >= DATE '2024-01-01'
  GROUP BY campaign_id
)
SELECT *
FROM old_result
EXCEPT
SELECT *
FROM new_result;