Context
You’re on the Risk & Fraud Data Platform team at a large fintech payments company that processes card-not-present transactions for ~40K merchants. The business runs real-time fraud scoring, but the model training and regulatory reporting depend on a nightly batch ETL that produces curated feature tables and audit aggregates.
The pipeline is orchestrated in Apache Airflow 2.x and runs on Spark 3.4 (Kubernetes). It ingests raw events from Kafka (landed to object storage) and joins them with multiple slowly changing dimensions (merchant profiles, device reputation, chargeback outcomes) before writing Parquet to a lakehouse and publishing downstream tables used by analysts and modelers.
Over the last month, the job has started failing 2–4 times per week with OutOfMemoryError during the shuffle phase, usually between 02:00–04:00 UTC. When it fails, the team misses a 06:00 UTC SLA for fraud feature freshness, which has led to degraded model performance and a measurable increase in false positives (declines) for high-value customers. You’re asked to lead the debugging and remediation.
Current Architecture (relevant pieces)
- Ingestion: Kafka topics → hourly files in object storage (JSON), partitioned by
event_date and event_hour
- Processing: Airflow triggers a Spark batch job that:
- reads the last 24h of events,
- filters/cleans,
- joins to dimensions,
- aggregates to merchant-day and user-day feature tables,
- writes Parquet partitioned by
event_date
- Serving: Feature tables queried by Trino and loaded to Snowflake for BI
Scale Requirements
- Input volume: ~6 TB/day raw JSON (avg), up to 12 TB/day during peak shopping periods
- Event count: ~25–50 billion rows/day
- Peak partition skew: top 0.1% of merchants generate ~15% of events
- SLA: job must complete in < 3 hours (02:00–05:00 UTC window)
- Cluster: 200 executors typical (each 8 vCPU, 32 GB RAM), autoscaling allowed but cost-capped
- Cost constraint: incremental compute increase must be < 20% month-over-month
Data Characteristics
Core fact table (transactions)
| field | type | notes |
|---|
| event_ts | timestamp | event time, can be late up to 48h |
| event_date | date | derived from event_ts |
| merchant_id | string | highly skewed |
| user_id | string | high cardinality |
| device_id | string | high cardinality |
| amount_usd | decimal(18,2) | |
| payment_method | string | small domain |
| ip | string | |
| risk_score | double | produced upstream |
Dimensions
merchant_dim (~2M rows, SCD2, updated daily)
device_reputation (~500M rows, updated hourly)
chargeback_outcomes (~200M rows, late arriving up to 30 days)
Known quality issues
- Late-arriving chargebacks cause backfills
- Occasional duplicate events (same
event_id)
- Schema evolution in raw JSON (new fields)
The Failure Symptom
Airflow logs show the Spark application failing with errors like:
java.lang.OutOfMemoryError: Java heap space
org.apache.spark.shuffle.FetchFailedException followed by executor loss
- Spikes in
shuffleBytesWritten and long GC pauses
The failure correlates with days when a small number of merchants have extreme traffic spikes.
Your Task (what you must answer in the interview)
Provide a structured plan to debug and fix the shuffle OOM, and propose a production-ready remediation that prevents recurrence.
1) Debugging approach (be specific)
Explain how you would:
- Identify the exact stage(s) and transformation(s) causing the shuffle blow-up (e.g., wide joins, aggregations, repartitions).
- Use Spark UI / event logs to confirm whether the root cause is:
- skewed keys,
- too many shuffle partitions / too few,
- oversized records,
- spill-to-disk misconfiguration,
- executor memory vs overhead imbalance,
- broadcast join failures,
- or upstream data anomalies.
- Reproduce the issue safely (sampling strategy, replaying a problematic partition/day) without burning the full cluster.
2) Remediation design
Propose changes across code, configuration, and data layout. Your solution should include:
- Concrete Spark changes (e.g., adaptive query execution, skew join handling, salting strategy, map-side aggregation, avoiding unnecessary repartitions, broadcast hints with thresholds).
- Resource tuning (executor sizing, memoryOverhead, shuffle service, local disk sizing, Kubernetes/YARN specifics).
- Data layout improvements (partitioning strategy, bucketing, file sizing, compaction) to reduce shuffle and improve join locality.
- Handling late-arriving and backfill scenarios without reintroducing OOM risk.
3) Operationalization
- Airflow changes (retries, backoff, circuit breakers, automatic rollback to a safe mode).
- Monitoring/alerting: what metrics you’d add, where, and what thresholds would page on-call.
- A validation plan to ensure correctness (no silent data loss/duplication) and performance regression testing.
Constraints
- Must remain on Spark 3.4 and the existing lakehouse storage (Parquet on object storage).
- You may introduce one additional managed component (e.g., a metrics store, a shuffle service, or a table format) if justified.
- Compliance: audit tables must be reproducible; transformations must be deterministic and idempotent.
- Team skillset: strong Spark + Airflow; moderate Kubernetes; limited Scala (mostly PySpark).
What we’re evaluating
- Your ability to reason from symptoms to root cause using Spark internals
- Practical mitigation strategies for shuffle-heavy workloads at scale
- Production readiness: observability, failure modes, and cost-aware tuning
- Data correctness under retries, late data, and backfills