Interview Guides

Spark Narrow vs Wide Dependencies

Medium

Coding

Asked at 1 company1Data Modeling

Also asked at

Problem

Context

In distributed systems interviews, Spark dependency types are used to test whether you understand execution planning, shuffles, and performance trade-offs.

Question

Explain the difference between narrow and wide dependencies in Spark.

Your answer should cover:

What each dependency type means at the partition level
Which operations typically create each type of dependency
Why wide dependencies usually trigger shuffles and are more expensive
How dependency type affects fault tolerance, stage boundaries, and performance tuning

Scope Guidance

The interviewer expects a systems-oriented explanation rather than Spark API memorization. You should define both terms clearly, compare them directly, and connect them to execution behavior such as stage splitting, data movement, and recovery after failure. Brief examples using common transformations like map, filter, reduceByKey, or join are enough.

Key Concepts

Narrow Dependency

A narrow dependency means each child partition depends on a small, fixed number of parent partitions, often exactly one. Because data can be computed locally without redistributing records across the cluster, narrow dependencies are usually cheaper and pipeline well within the same stage.

rdd2 = rdd1.map(lambda x: x * 2)
rdd3 = rdd2.filter(lambda x: x % 2 == 0)

Wide Dependency

A wide dependency means a child partition depends on many parent partitions. This usually requires redistributing data across executors so matching keys or grouped records end up together, which introduces a shuffle.

pairs = rdd.map(lambda x: (x % 10, x))
result = pairs.reduceByKey(lambda a, b: a + b)

Shuffle and Stage Boundaries

Spark can pipeline narrow transformations into a single stage because partitions can be processed locally. A wide dependency creates a shuffle boundary, so Spark splits the job into multiple stages separated by data exchange.

Fault Recovery

With narrow dependencies, Spark can often recompute a lost partition from a small subset of upstream partitions. With wide dependencies, recovering output may involve reading shuffle data or recomputing larger portions of the lineage, which is typically more expensive.

Performance Implications

Narrow dependencies are generally faster because they avoid network transfer, disk spill, and shuffle coordination. Wide dependencies are often the main source of latency and resource pressure in Spark jobs, so minimizing unnecessary shuffles is a key optimization skill.

Problem

Context

In distributed systems interviews, Spark dependency types are used to test whether you understand execution planning, shuffles, and performance trade-offs.

Question

Explain the difference between narrow and wide dependencies in Spark.

Your answer should cover:

What each dependency type means at the partition level
Which operations typically create each type of dependency
Why wide dependencies usually trigger shuffles and are more expensive
How dependency type affects fault tolerance, stage boundaries, and performance tuning

Scope Guidance

Key Concepts

Narrow Dependency

rdd2 = rdd1.map(lambda x: x * 2)
rdd3 = rdd2.filter(lambda x: x % 2 == 0)

Wide Dependency

pairs = rdd.map(lambda x: (x % 10, x))
result = pairs.reduceByKey(lambda a, b: a + b)

Shuffle and Stage Boundaries

Fault Recovery

Performance Implications

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Window Functions vs GROUP BYEasy

Next question

Spark Narrow vs Wide Dependencies

Medium

Coding

Asked at 1 company1Data Modeling

Also asked at

Problem

Context

In distributed systems interviews, Spark dependency types are used to test whether you understand execution planning, shuffles, and performance trade-offs.

Question

Explain the difference between narrow and wide dependencies in Spark.

Your answer should cover:

What each dependency type means at the partition level
Which operations typically create each type of dependency
Why wide dependencies usually trigger shuffles and are more expensive
How dependency type affects fault tolerance, stage boundaries, and performance tuning

Scope Guidance

Key Concepts

Narrow Dependency

rdd2 = rdd1.map(lambda x: x * 2)
rdd3 = rdd2.filter(lambda x: x % 2 == 0)

Wide Dependency

pairs = rdd.map(lambda x: (x % 10, x))
result = pairs.reduceByKey(lambda a, b: a + b)

Shuffle and Stage Boundaries

Fault Recovery

Performance Implications

Problem

Context

In distributed systems interviews, Spark dependency types are used to test whether you understand execution planning, shuffles, and performance trade-offs.

Question

Explain the difference between narrow and wide dependencies in Spark.

Your answer should cover:

What each dependency type means at the partition level
Which operations typically create each type of dependency
Why wide dependencies usually trigger shuffles and are more expensive
How dependency type affects fault tolerance, stage boundaries, and performance tuning

Scope Guidance

Key Concepts

Narrow Dependency

rdd2 = rdd1.map(lambda x: x * 2)
rdd3 = rdd2.filter(lambda x: x % 2 == 0)

Wide Dependency

pairs = rdd.map(lambda x: (x % 10, x))
result = pairs.reduceByKey(lambda a, b: a + b)

Shuffle and Stage Boundaries

Fault Recovery

Performance Implications

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Window Functions vs GROUP BYEasy

Next question

Spark Narrow vs Wide Dependencies | Dataford Interview Questions - Dataford - Ace your Interview