You're reviewing a Spark pipeline that joins a very large transaction dataset with a smaller reference dataset. The job is slow and unstable because a few join keys dominate the data, causing some tasks to run much longer than the rest.
How do you optimize a Spark job that is experiencing severe data skew when joining a massive transaction table with a smaller merchant metadata table?
You're reviewing a Spark pipeline that joins a very large transaction dataset with a smaller reference dataset. The job is slow and unstable because a few join keys dominate the data, causing some tasks to run much longer than the rest.
How do you optimize a Spark job that is experiencing severe data skew when joining a massive transaction table with a smaller merchant metadata table?