Fix Spark Shuffle OOM in ETL

Context

You’re on the Risk & Fraud Data Platform team at a large fintech payments company that processes card-not-present transactions for ~40K merchants. The business runs real-time fraud scoring, but the model training and regulatory reporting depend on a nightly batch ETL that produces curated feature tables and audit aggregates.

The pipeline is orchestrated in Apache Airflow 2.x and runs on Spark 3.4 (Kubernetes). It ingests raw events from Kafka (landed to object storage) and joins them with multiple slowly changing dimensions (merchant profiles, device reputation, chargeback outcomes) before writing Parquet to a lakehouse and publishing downstream tables used by analysts and modelers.

Over the last month, the job has started failing 2–4 times per week with OutOfMemoryError during the shuffle phase, usually between 02:00–04:00 UTC. When it fails, the team misses a 06:00 UTC SLA for fraud feature freshness, which has led to degraded model performance and a measurable increase in false positives (declines) for high-value customers. You’re asked to lead the debugging and remediation.

Current Architecture (relevant pieces)

Ingestion: Kafka topics → hourly files in object storage (JSON), partitioned by event_date and event_hour
Processing: Airflow triggers a Spark batch job that:
1. reads the last 24h of events,
2. filters/cleans,
3. joins to dimensions,
4. aggregates to merchant-day and user-day feature tables,
5. writes Parquet partitioned by event_date
Serving: Feature tables queried by Trino and loaded to Snowflake for BI

Scale Requirements

Input volume: ~6 TB/day raw JSON (avg), up to 12 TB/day during peak shopping periods
Event count: ~25–50 billion rows/day
Peak partition skew: top 0.1% of merchants generate ~15% of events
SLA: job must complete in < 3 hours (02:00–05:00 UTC window)
Cluster: 200 executors typical (each 8 vCPU, 32 GB RAM), autoscaling allowed but cost-capped
Cost constraint: incremental compute increase must be < 20% month-over-month

Data Characteristics

Core fact table (transactions)

field	type	notes
event_ts	timestamp	event time, can be late up to 48h
event_date	date	derived from event_ts
merchant_id	string	highly skewed
user_id	string	high cardinality
device_id	string	high cardinality
amount_usd	decimal(18,2)
payment_method	string	small domain
ip	string
risk_score	double	produced upstream

Dimensions

merchant_dim (~2M rows, SCD2, updated daily)
device_reputation (~500M rows, updated hourly)
chargeback_outcomes (~200M rows, late arriving up to 30 days)

Known quality issues

Late-arriving chargebacks cause backfills
Occasional duplicate events (same event_id)
Schema evolution in raw JSON (new fields)

The Failure Symptom

Airflow logs show the Spark application failing with errors like:

java.lang.OutOfMemoryError: Java heap space
org.apache.spark.shuffle.FetchFailedException followed by executor loss
Spikes in shuffleBytesWritten and long GC pauses

The failure correlates with days when a small number of merchants have extreme traffic spikes.

Your Task (what you must answer in the interview)

Provide a structured plan to debug and fix the shuffle OOM, and propose a production-ready remediation that prevents recurrence.

1) Debugging approach (be specific)

Explain how you would:

Identify the exact stage(s) and transformation(s) causing the shuffle blow-up (e.g., wide joins, aggregations, repartitions).
Use Spark UI / event logs to confirm whether the root cause is:
- skewed keys,
- too many shuffle partitions / too few,
- oversized records,
- spill-to-disk misconfiguration,
- executor memory vs overhead imbalance,
- broadcast join failures,
- or upstream data anomalies.
Reproduce the issue safely (sampling strategy, replaying a problematic partition/day) without burning the full cluster.

2) Remediation design

Propose changes across code, configuration, and data layout. Your solution should include:

Concrete Spark changes (e.g., adaptive query execution, skew join handling, salting strategy, map-side aggregation, avoiding unnecessary repartitions, broadcast hints with thresholds).
Resource tuning (executor sizing, memoryOverhead, shuffle service, local disk sizing, Kubernetes/YARN specifics).
Data layout improvements (partitioning strategy, bucketing, file sizing, compaction) to reduce shuffle and improve join locality.
Handling late-arriving and backfill scenarios without reintroducing OOM risk.

3) Operationalization

Airflow changes (retries, backoff, circuit breakers, automatic rollback to a safe mode).
Monitoring/alerting: what metrics you’d add, where, and what thresholds would page on-call.
A validation plan to ensure correctness (no silent data loss/duplication) and performance regression testing.

Constraints

Must remain on Spark 3.4 and the existing lakehouse storage (Parquet on object storage).
You may introduce one additional managed component (e.g., a metrics store, a shuffle service, or a table format) if justified.
Compliance: audit tables must be reproducible; transformations must be deterministic and idempotent.
Team skillset: strong Spark + Airflow; moderate Kubernetes; limited Scala (mostly PySpark).

What we’re evaluating

Your ability to reason from symptoms to root cause using Spark internals
Practical mitigation strategies for shuffle-heavy workloads at scale
Production readiness: observability, failure modes, and cost-aware tuning
Data correctness under retries, late data, and backfills

Context

Current Architecture (relevant pieces)

Ingestion: Kafka topics → hourly files in object storage (JSON), partitioned by event_date and event_hour
Processing: Airflow triggers a Spark batch job that:
1. reads the last 24h of events,
2. filters/cleans,
3. joins to dimensions,
4. aggregates to merchant-day and user-day feature tables,
5. writes Parquet partitioned by event_date
Serving: Feature tables queried by Trino and loaded to Snowflake for BI

Scale Requirements

Input volume: ~6 TB/day raw JSON (avg), up to 12 TB/day during peak shopping periods
Event count: ~25–50 billion rows/day
Peak partition skew: top 0.1% of merchants generate ~15% of events
SLA: job must complete in < 3 hours (02:00–05:00 UTC window)
Cluster: 200 executors typical (each 8 vCPU, 32 GB RAM), autoscaling allowed but cost-capped
Cost constraint: incremental compute increase must be < 20% month-over-month

Data Characteristics

Core fact table (transactions)

field	type	notes
event_ts	timestamp	event time, can be late up to 48h
event_date	date	derived from event_ts
merchant_id	string	highly skewed
user_id	string	high cardinality
device_id	string	high cardinality
amount_usd	decimal(18,2)
payment_method	string	small domain
ip	string
risk_score	double	produced upstream

Dimensions

merchant_dim (~2M rows, SCD2, updated daily)
device_reputation (~500M rows, updated hourly)
chargeback_outcomes (~200M rows, late arriving up to 30 days)

Known quality issues

Late-arriving chargebacks cause backfills
Occasional duplicate events (same event_id)
Schema evolution in raw JSON (new fields)

The Failure Symptom

Airflow logs show the Spark application failing with errors like:

java.lang.OutOfMemoryError: Java heap space
org.apache.spark.shuffle.FetchFailedException followed by executor loss
Spikes in shuffleBytesWritten and long GC pauses

The failure correlates with days when a small number of merchants have extreme traffic spikes.

Your Task (what you must answer in the interview)

Provide a structured plan to debug and fix the shuffle OOM, and propose a production-ready remediation that prevents recurrence.

1) Debugging approach (be specific)

Explain how you would:

Identify the exact stage(s) and transformation(s) causing the shuffle blow-up (e.g., wide joins, aggregations, repartitions).
Use Spark UI / event logs to confirm whether the root cause is:
- skewed keys,
- too many shuffle partitions / too few,
- oversized records,
- spill-to-disk misconfiguration,
- executor memory vs overhead imbalance,
- broadcast join failures,
- or upstream data anomalies.
Reproduce the issue safely (sampling strategy, replaying a problematic partition/day) without burning the full cluster.

2) Remediation design

Propose changes across code, configuration, and data layout. Your solution should include:

Concrete Spark changes (e.g., adaptive query execution, skew join handling, salting strategy, map-side aggregation, avoiding unnecessary repartitions, broadcast hints with thresholds).
Resource tuning (executor sizing, memoryOverhead, shuffle service, local disk sizing, Kubernetes/YARN specifics).
Data layout improvements (partitioning strategy, bucketing, file sizing, compaction) to reduce shuffle and improve join locality.
Handling late-arriving and backfill scenarios without reintroducing OOM risk.

3) Operationalization

Airflow changes (retries, backoff, circuit breakers, automatic rollback to a safe mode).
Monitoring/alerting: what metrics you’d add, where, and what thresholds would page on-call.
A validation plan to ensure correctness (no silent data loss/duplication) and performance regression testing.

Constraints

Must remain on Spark 3.4 and the existing lakehouse storage (Parquet on object storage).
You may introduce one additional managed component (e.g., a metrics store, a shuffle service, or a table format) if justified.
Compliance: audit tables must be reproducible; transformations must be deterministic and idempotent.
Team skillset: strong Spark + Airflow; moderate Kubernetes; limited Scala (mostly PySpark).

What we’re evaluating

Your ability to reason from symptoms to root cause using Spark internals
Practical mitigation strategies for shuffle-heavy workloads at scale
Production readiness: observability, failure modes, and cost-aware tuning
Data correctness under retries, late data, and backfills

Context

Current Architecture (relevant pieces)

Ingestion: Kafka topics → hourly files in object storage (JSON), partitioned by event_date and event_hour
Processing: Airflow triggers a Spark batch job that:
1. reads the last 24h of events,
2. filters/cleans,
3. joins to dimensions,
4. aggregates to merchant-day and user-day feature tables,
5. writes Parquet partitioned by event_date
Serving: Feature tables queried by Trino and loaded to Snowflake for BI

Scale Requirements

Input volume: ~6 TB/day raw JSON (avg), up to 12 TB/day during peak shopping periods
Event count: ~25–50 billion rows/day
Peak partition skew: top 0.1% of merchants generate ~15% of events
SLA: job must complete in < 3 hours (02:00–05:00 UTC window)
Cluster: 200 executors typical (each 8 vCPU, 32 GB RAM), autoscaling allowed but cost-capped
Cost constraint: incremental compute increase must be < 20% month-over-month

Data Characteristics

Core fact table (transactions)

field	type	notes
event_ts	timestamp	event time, can be late up to 48h
event_date	date	derived from event_ts
merchant_id	string	highly skewed
user_id	string	high cardinality
device_id	string	high cardinality
amount_usd	decimal(18,2)
payment_method	string	small domain
ip	string
risk_score	double	produced upstream

Dimensions

merchant_dim (~2M rows, SCD2, updated daily)
device_reputation (~500M rows, updated hourly)
chargeback_outcomes (~200M rows, late arriving up to 30 days)

Known quality issues

Late-arriving chargebacks cause backfills
Occasional duplicate events (same event_id)
Schema evolution in raw JSON (new fields)

The Failure Symptom

Airflow logs show the Spark application failing with errors like:

java.lang.OutOfMemoryError: Java heap space
org.apache.spark.shuffle.FetchFailedException followed by executor loss
Spikes in shuffleBytesWritten and long GC pauses

The failure correlates with days when a small number of merchants have extreme traffic spikes.

Your Task (what you must answer in the interview)

Provide a structured plan to debug and fix the shuffle OOM, and propose a production-ready remediation that prevents recurrence.

1) Debugging approach (be specific)

Explain how you would:

Identify the exact stage(s) and transformation(s) causing the shuffle blow-up (e.g., wide joins, aggregations, repartitions).
Use Spark UI / event logs to confirm whether the root cause is:
- skewed keys,
- too many shuffle partitions / too few,
- oversized records,
- spill-to-disk misconfiguration,
- executor memory vs overhead imbalance,
- broadcast join failures,
- or upstream data anomalies.
Reproduce the issue safely (sampling strategy, replaying a problematic partition/day) without burning the full cluster.

2) Remediation design

Propose changes across code, configuration, and data layout. Your solution should include:

Concrete Spark changes (e.g., adaptive query execution, skew join handling, salting strategy, map-side aggregation, avoiding unnecessary repartitions, broadcast hints with thresholds).
Resource tuning (executor sizing, memoryOverhead, shuffle service, local disk sizing, Kubernetes/YARN specifics).
Data layout improvements (partitioning strategy, bucketing, file sizing, compaction) to reduce shuffle and improve join locality.
Handling late-arriving and backfill scenarios without reintroducing OOM risk.

3) Operationalization

Airflow changes (retries, backoff, circuit breakers, automatic rollback to a safe mode).
Monitoring/alerting: what metrics you’d add, where, and what thresholds would page on-call.
A validation plan to ensure correctness (no silent data loss/duplication) and performance regression testing.

Constraints

Must remain on Spark 3.4 and the existing lakehouse storage (Parquet on object storage).
You may introduce one additional managed component (e.g., a metrics store, a shuffle service, or a table format) if justified.
Compliance: audit tables must be reproducible; transformations must be deterministic and idempotent.
Team skillset: strong Spark + Airflow; moderate Kubernetes; limited Scala (mostly PySpark).

What we’re evaluating

Your ability to reason from symptoms to root cause using Spark internals
Practical mitigation strategies for shuffle-heavy workloads at scale
Production readiness: observability, failure modes, and cost-aware tuning
Data correctness under retries, late data, and backfills

Context

Current Architecture (relevant pieces)

Ingestion: Kafka topics → hourly files in object storage (JSON), partitioned by event_date and event_hour
Processing: Airflow triggers a Spark batch job that:
1. reads the last 24h of events,
2. filters/cleans,
3. joins to dimensions,
4. aggregates to merchant-day and user-day feature tables,
5. writes Parquet partitioned by event_date
Serving: Feature tables queried by Trino and loaded to Snowflake for BI

Scale Requirements

Input volume: ~6 TB/day raw JSON (avg), up to 12 TB/day during peak shopping periods
Event count: ~25–50 billion rows/day
Peak partition skew: top 0.1% of merchants generate ~15% of events
SLA: job must complete in < 3 hours (02:00–05:00 UTC window)
Cluster: 200 executors typical (each 8 vCPU, 32 GB RAM), autoscaling allowed but cost-capped
Cost constraint: incremental compute increase must be < 20% month-over-month

Data Characteristics

Core fact table (transactions)

field	type	notes
event_ts	timestamp	event time, can be late up to 48h
event_date	date	derived from event_ts
merchant_id	string	highly skewed
user_id	string	high cardinality
device_id	string	high cardinality
amount_usd	decimal(18,2)
payment_method	string	small domain
ip	string
risk_score	double	produced upstream

Dimensions

merchant_dim (~2M rows, SCD2, updated daily)
device_reputation (~500M rows, updated hourly)
chargeback_outcomes (~200M rows, late arriving up to 30 days)

Known quality issues

Late-arriving chargebacks cause backfills
Occasional duplicate events (same event_id)
Schema evolution in raw JSON (new fields)

The Failure Symptom

Airflow logs show the Spark application failing with errors like:

java.lang.OutOfMemoryError: Java heap space
org.apache.spark.shuffle.FetchFailedException followed by executor loss
Spikes in shuffleBytesWritten and long GC pauses

The failure correlates with days when a small number of merchants have extreme traffic spikes.

Your Task (what you must answer in the interview)

Provide a structured plan to debug and fix the shuffle OOM, and propose a production-ready remediation that prevents recurrence.

1) Debugging approach (be specific)

Explain how you would:

Identify the exact stage(s) and transformation(s) causing the shuffle blow-up (e.g., wide joins, aggregations, repartitions).
Use Spark UI / event logs to confirm whether the root cause is:
- skewed keys,
- too many shuffle partitions / too few,
- oversized records,
- spill-to-disk misconfiguration,
- executor memory vs overhead imbalance,
- broadcast join failures,
- or upstream data anomalies.
Reproduce the issue safely (sampling strategy, replaying a problematic partition/day) without burning the full cluster.

2) Remediation design

Propose changes across code, configuration, and data layout. Your solution should include:

Concrete Spark changes (e.g., adaptive query execution, skew join handling, salting strategy, map-side aggregation, avoiding unnecessary repartitions, broadcast hints with thresholds).
Resource tuning (executor sizing, memoryOverhead, shuffle service, local disk sizing, Kubernetes/YARN specifics).
Data layout improvements (partitioning strategy, bucketing, file sizing, compaction) to reduce shuffle and improve join locality.
Handling late-arriving and backfill scenarios without reintroducing OOM risk.

3) Operationalization

Airflow changes (retries, backoff, circuit breakers, automatic rollback to a safe mode).
Monitoring/alerting: what metrics you’d add, where, and what thresholds would page on-call.
A validation plan to ensure correctness (no silent data loss/duplication) and performance regression testing.

Constraints

Must remain on Spark 3.4 and the existing lakehouse storage (Parquet on object storage).
You may introduce one additional managed component (e.g., a metrics store, a shuffle service, or a table format) if justified.
Compliance: audit tables must be reproducible; transformations must be deterministic and idempotent.
Team skillset: strong Spark + Airflow; moderate Kubernetes; limited Scala (mostly PySpark).

What we’re evaluating

Your ability to reason from symptoms to root cause using Spark internals
Practical mitigation strategies for shuffle-heavy workloads at scale
Production readiness: observability, failure modes, and cost-aware tuning
Data correctness under retries, late data, and backfills

Interview Guides

Context

Current Architecture (relevant pieces)

Scale Requirements

Data Characteristics

Core fact table (transactions)

Dimensions

Known quality issues

The Failure Symptom

Your Task (what you must answer in the interview)

1) Debugging approach (be specific)

2) Remediation design

3) Operationalization

Constraints

What we’re evaluating

Fix Spark Shuffle OOM in ETL

Context

Current Architecture (relevant pieces)

Scale Requirements

Data Characteristics

Core fact table (transactions)

Dimensions

Known quality issues

The Failure Symptom

Your Task (what you must answer in the interview)

1) Debugging approach (be specific)

2) Remediation design

3) Operationalization

Constraints

What we’re evaluating

Your Answer

Fix Spark Shuffle OOM in ETL

Context

Current Architecture (relevant pieces)

Scale Requirements

Data Characteristics

Core fact table (transactions)

Dimensions

Known quality issues

The Failure Symptom

Your Task (what you must answer in the interview)

1) Debugging approach (be specific)

2) Remediation design

3) Operationalization

Constraints

What we’re evaluating

Fix Spark Shuffle OOM in ETL

Context

Current Architecture (relevant pieces)

Scale Requirements

Data Characteristics

Core fact table (transactions)

Dimensions

Known quality issues

The Failure Symptom

Your Task (what you must answer in the interview)

1) Debugging approach (be specific)

2) Remediation design

3) Operationalization

Constraints

What we’re evaluating

Your Answer