Interview Guides

Repartition vs Coalesce Semantics | Dataford Interview Questions - Dataford - Ace your Interview

Repartition vs Coalesce Semantics

Medium

Coding

Asked at 1 company1Infrastructure

Also asked at

Problem

Context

In distributed data processing, changing the number of partitions affects parallelism, shuffle cost, and skew. Spark exposes two APIs—repartition() and coalesce()—that both change partitioning but with different guarantees and costs.

Core Question

Explain the difference between repartition() and coalesce() in Spark.

Address these points:

What each operation does to the number of partitions and the distribution of records.
Whether it triggers a full shuffle, and what that implies for network and disk I/O.
When you would prefer one over the other (e.g., increasing partitions, decreasing partitions, mitigating skew, preparing for joins/writes).

Scope Guidance

Assume the interviewer expects practical engineering depth: mention shuffle boundaries, narrow vs wide dependencies, and typical performance implications. You do not need to write Spark code, but you should be able to reason about execution plans and trade-offs.

Key Concepts

Shuffle vs Narrow Dependency

A shuffle is a wide dependency where records move across the network to satisfy a new partitioning. A narrow dependency can be computed without moving data between executors, typically by reading a subset of parent partitions.

repartition(): full redistribution

repartition(n) changes the partition count by performing a shuffle to evenly distribute records across n partitions. It is commonly used to increase parallelism or to rebalance skewed partitions at the cost of network I/O.

df = df.repartition(200)  # forces shuffle to 200 partitions

coalesce(): partition collapsing

coalesce(n) typically reduces the number of partitions by collapsing multiple parent partitions into fewer child partitions without a full shuffle. This avoids network cost but can produce uneven partition sizes and reduce parallelism if overused.

df = df.coalesce(20)  # usually no shuffle; merges partitions

Skew and Parallelism Trade-offs

Reducing partitions can lower task scheduling overhead and the number of output files, but may create large partitions that become stragglers. Increasing partitions can improve parallelism but may increase overhead and shuffle volume.

Coalesce with Shuffle Option

Some Spark APIs allow coalescing with shuffle enabled (e.g., coalesce(n, shuffle=true) in certain contexts), which makes it behave closer to repartition by redistributing data for better balance. The key distinction is that repartition always shuffles, while coalesce is designed to avoid it by default.

Problem

Context

Core Question

Explain the difference between repartition() and coalesce() in Spark.

Address these points:

What each operation does to the number of partitions and the distribution of records.
Whether it triggers a full shuffle, and what that implies for network and disk I/O.
When you would prefer one over the other (e.g., increasing partitions, decreasing partitions, mitigating skew, preparing for joins/writes).

Scope Guidance

Key Concepts

Shuffle vs Narrow Dependency

repartition(): full redistribution

df = df.repartition(200)  # forces shuffle to 200 partitions

coalesce(): partition collapsing

df = df.coalesce(20)  # usually no shuffle; merges partitions

Skew and Parallelism Trade-offs

Coalesce with Shuffle Option

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Repartition vs Coalesce Semantics

Medium

Coding

Asked at 1 company1Infrastructure

Also asked at

Problem

Context

Core Question

Explain the difference between repartition() and coalesce() in Spark.

Address these points:

What each operation does to the number of partitions and the distribution of records.
Whether it triggers a full shuffle, and what that implies for network and disk I/O.
When you would prefer one over the other (e.g., increasing partitions, decreasing partitions, mitigating skew, preparing for joins/writes).

Scope Guidance

Key Concepts

Shuffle vs Narrow Dependency

repartition(): full redistribution

df = df.repartition(200)  # forces shuffle to 200 partitions

coalesce(): partition collapsing

df = df.coalesce(20)  # usually no shuffle; merges partitions

Skew and Parallelism Trade-offs

Coalesce with Shuffle Option

Problem

Context

Core Question

Explain the difference between repartition() and coalesce() in Spark.

Address these points:

What each operation does to the number of partitions and the distribution of records.
Whether it triggers a full shuffle, and what that implies for network and disk I/O.
When you would prefer one over the other (e.g., increasing partitions, decreasing partitions, mitigating skew, preparing for joins/writes).

Scope Guidance

Key Concepts

Shuffle vs Narrow Dependency

repartition(): full redistribution

df = df.repartition(200)  # forces shuffle to 200 partitions

coalesce(): partition collapsing

df = df.coalesce(20)  # usually no shuffle; merges partitions

Skew and Parallelism Trade-offs

Coalesce with Shuffle Option

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200