Apache Spark runs distributed computations on clusters where executors and machines can fail. A strong answer explains how Spark recovers results without requiring full data replication.
Explain how Spark handles fault tolerance in batch and streaming workloads.
Address the following:
Assume the interviewer expects a systems-level explanation (driver, executors, tasks, stages) and the trade-offs (recompute vs. storage/replication). Mention at least one failure scenario (executor loss) and walk through recovery steps at a high level.