
In distributed systems interviews, Spark dependency types are used to test whether you understand execution planning, shuffles, and performance trade-offs.
Explain the difference between narrow and wide dependencies in Spark.
Your answer should cover:
The interviewer expects a systems-oriented explanation rather than Spark API memorization. You should define both terms clearly, compare them directly, and connect them to execution behavior such as stage splitting, data movement, and recovery after failure. Brief examples using common transformations like map, filter, reduceByKey, or join are enough.
A narrow dependency means each child partition depends on a small, fixed number of parent partitions, often exactly one. Because data can be computed locally without redistributing records across the cluster, narrow dependencies are usually cheaper and pipeline well within the same stage.
rdd2 = rdd1.map(lambda x: x * 2)
rdd3 = rdd2.filter(lambda x: x % 2 == 0)
A wide dependency means a child partition depends on many parent partitions. This usually requires redistributing data across executors so matching keys or grouped records end up together, which introduces a shuffle.
pairs = rdd.map(lambda x: (x % 10, x))
result = pairs.reduceByKey(lambda a, b: a + b)
Spark can pipeline narrow transformations into a single stage because partitions can be processed locally. A wide dependency creates a shuffle boundary, so Spark splits the job into multiple stages separated by data exchange.
With narrow dependencies, Spark can often recompute a lost partition from a small subset of upstream partitions. With wide dependencies, recovering output may involve reading shuffle data or recomputing larger portions of the lineage, which is typically more expensive.
Narrow dependencies are generally faster because they avoid network transfer, disk spill, and shuffle coordination. Wide dependencies are often the main source of latency and resource pressure in Spark jobs, so minimizing unnecessary shuffles is a key optimization skill.