Apache Spark runs distributed computations on clusters where executors and machines can fail. A strong answer explains how Spark recovers results without requiring full data replication.
Explain how Spark handles fault tolerance in batch and streaming workloads.
Address the following:
Assume the interviewer expects a systems-level explanation (driver, executors, tasks, stages) and the trade-offs (recompute vs. storage/replication). Mention at least one failure scenario (executor loss) and walk through recovery steps at a high level.
Spark tracks transformations as a directed acyclic graph (DAG) of operations. If a partition is lost, Spark can re-run the portion of the DAG needed to rebuild only the missing partitions, rather than restarting the whole job.
rdd2 = rdd1.map(f).filter(g) # lineage: rdd1 -> map -> filter
The scheduler splits the DAG into stages separated by shuffle boundaries. On executor/task failure, Spark retries failed tasks (and sometimes entire stages) and reschedules them on healthy executors using the same deterministic computation.
Shuffle introduces materialized intermediate data (map outputs) that may be stored on local disks of executors. If those outputs are lost due to executor loss, Spark can recompute the upstream map tasks to regenerate shuffle blocks and then rerun downstream reduce tasks as needed.
Persisting caches computed partitions for performance but is not a durable recovery mechanism because cached blocks can be lost with an executor. Checkpointing truncates lineage by writing a dataset to reliable storage, improving fault recovery and preventing expensive recomputation for very long lineages or iterative algorithms.
rdd.checkpoint() # materialize to reliable storage to cut lineage
In Structured Streaming, Spark uses checkpoint logs (offsets, state store metadata) to recover progress after failure. Exactly-once is achieved for supported sinks via idempotent/transactional writes plus replay from offsets; otherwise, semantics may degrade to at-least-once.