
In analytics work at Google, many datasets are not at the same grain. For example, a Google Ads campaign can map to multiple labels, and a conversion can be associated with multiple touchpoints. If you join these tables directly, your counts and sums can inflate.
Explain how you would handle a many-to-many join in SQL without inflating metrics. Your answer should cover:
The interviewer is not looking for obscure syntax. They want a clear explanation of table grain, deduplication strategy, and when to pre-aggregate or use an intermediate mapping table before joining. Use simple PostgreSQL examples where helpful, and frame your answer around common analytics reporting problems.
Before joining, identify the unit of each table: one row per customer, one row per session, one row per campaign-label pair, and so on. Many-to-many problems usually happen when neither side is unique on the join key, so the join multiplies rows.
SELECT campaign_id, COUNT(*) AS row_count
FROM campaign_labels
GROUP BY campaign_id
HAVING COUNT(*) > 1;
If both tables contain multiple rows for the same key, a direct join creates all combinations of matching rows. That can overstate metrics like revenue, conversions, impressions, or active users.
SELECT *
FROM ad_clicks c
JOIN conversion_touchpoints t
ON c.user_id = t.user_id;
A common fix is to aggregate each table to the reporting grain before joining. This reduces duplication because the join happens after each side has been collapsed to one row per key.
WITH clicks AS (
SELECT campaign_id, COUNT(*) AS total_clicks
FROM ad_clicks
GROUP BY campaign_id
), labels AS (
SELECT campaign_id, COUNT(DISTINCT label_id) AS label_count
FROM campaign_labels
GROUP BY campaign_id
)
SELECT c.campaign_id, c.total_clicks, l.label_count
FROM clicks c
LEFT JOIN labels l
ON c.campaign_id = l.campaign_id;
Sometimes you need the relationship itself, not just aggregated metrics. In that case, use a deduplicated bridge table or assign one canonical record per entity before joining, rather than joining raw many-to-many tables directly.
WITH ranked_labels AS (
SELECT campaign_id, label_id,
ROW_NUMBER() OVER (PARTITION BY campaign_id ORDER BY label_id) AS rn
FROM campaign_labels
)
SELECT campaign_id, label_id
FROM ranked_labels
WHERE rn = 1;
After rewriting the query, compare totals against trusted source tables at the base grain. If campaign revenue after the join exceeds the original campaign revenue total, the join is still duplicating data.
SELECT SUM(revenue) AS base_revenue
FROM campaign_daily_metrics;