
In distributed data processing, changing the number of partitions affects parallelism, shuffle cost, and skew. Spark exposes two APIs—repartition() and coalesce()—that both change partitioning but with different guarantees and costs.
Explain the difference between repartition() and coalesce() in Spark.
Address these points:
Assume the interviewer expects practical engineering depth: mention shuffle boundaries, narrow vs wide dependencies, and typical performance implications. You do not need to write Spark code, but you should be able to reason about execution plans and trade-offs.
A shuffle is a wide dependency where records move across the network to satisfy a new partitioning. A narrow dependency can be computed without moving data between executors, typically by reading a subset of parent partitions.
repartition(n) changes the partition count by performing a shuffle to evenly distribute records across n partitions. It is commonly used to increase parallelism or to rebalance skewed partitions at the cost of network I/O.
df = df.repartition(200) # forces shuffle to 200 partitions
coalesce(n) typically reduces the number of partitions by collapsing multiple parent partitions into fewer child partitions without a full shuffle. This avoids network cost but can produce uneven partition sizes and reduce parallelism if overused.
df = df.coalesce(20) # usually no shuffle; merges partitions
Reducing partitions can lower task scheduling overhead and the number of output files, but may create large partitions that become stragglers. Increasing partitions can improve parallelism but may increase overhead and shuffle volume.
Some Spark APIs allow coalescing with shuffle enabled (e.g., coalesce(n, shuffle=true) in certain contexts), which makes it behave closer to repartition by redistributing data for better balance. The key distinction is that repartition always shuffles, while coalesce is designed to avoid it by default.