Optimize Spark ETL for Ledger Loads

Context

You’re interviewing for a Senior Data Engineer role on the Risk & Reconciliation platform at PayWave, a fintech that processes card payments and instant bank transfers across North America and Europe. PayWave has 18M monthly active customers and 220K merchants. The company’s revenue recognition and chargeback workflows depend on a daily “ledger truth” dataset that reconciles authorization, capture, refund, dispute, and settlement events.

Today, PayWave runs a large daily Spark batch job that ingests raw events from an S3-based data lake, performs deduplication and enrichment, and loads curated tables into Snowflake for Finance and Risk analysts. Over the last quarter, the job runtime has regressed from 2.5 hours to 7–9 hours, repeatedly missing the 06:00 UTC reporting cutoff. When the cutoff is missed, the Finance team delays settlement reporting and the Risk team loses the ability to detect anomalous merchant behavior before morning processing—creating direct regulatory and fraud exposure.

The on-call notes show frequent executor OOMs, long GC pauses, and stages with extreme skew (single tasks running 30–60 minutes). The pipeline also struggles with late-arriving settlement files (up to 48 hours late) and duplicate events from upstream retries. You are asked to propose a concrete optimization plan and the production-ready changes you’d make.

Current Architecture (as-is)

Layer	Technology	Details	Pain Points
Storage (raw)	Amazon S3	JSON + some Parquet, partitioned by `dt`	Small files, mixed formats, inconsistent partitioning
Processing	Spark 3.3 on EMR	Single monolithic job, dynamic allocation enabled	Skewed shuffles, OOM, unpredictable runtime
Orchestration	Airflow 2.x	One DAG task triggers entire job	Hard to retry partially; poor observability
Warehouse	Snowflake	`COPY INTO` from S3 stage	Loads are slow; micro-partitions poorly clustered
Transformations	Some SQL in Spark	Business logic embedded in Spark code	Hard to test; brittle schema evolution

Scale Requirements

Input volume: ~25 TB/day raw, peaking at 40 TB/day on high-traffic days
Records: ~12–18B events/day across multiple event types
SLA: Curated tables queryable in Snowflake by 06:00 UTC (job must finish in <2 hours)
Backfill: Ability to reprocess last 30 days within 24 hours without impacting daily SLA
Late data: Settlement events can arrive up to 48 hours late; must be incorporated correctly
Correctness: Exactly-once semantics at the curated table level (no double-counting)

Data Characteristics

Key datasets

auth_events (authorizations)
capture_events (captures)
refund_events
dispute_events
settlement_files (batch files from banks)

Common columns

event_id (string, globally unique but duplicates exist due to retries)
txn_id (string, joins across event types)
merchant_id (string, highly skewed: top 50 merchants generate ~35% of volume)
event_ts (timestamp, event time)
ingest_ts (timestamp, arrival time)
amount, currency, country, payment_method

Known quality issues

Duplicates: same event_id repeated with identical payload
Late-arriving settlement: event_ts in the past, ingest_ts much later
Schema drift: optional fields appear/disappear; nested JSON fields change
Skew: joins on merchant_id and txn_id cause large shuffles

Your Task

Design an optimization plan and propose specific implementation changes.

Functional requirements

Produce curated Snowflake tables:
- ledger_transactions (one row per txn_id with latest state)
- ledger_entries (double-entry accounting lines)
- merchant_daily_rollups (aggregates by merchant/day/currency)
Correctly handle late-arriving settlement and dispute events (up to 48 hours) via incremental recomputation.
Ensure idempotent loads into Snowflake (retries must not create duplicates).
Provide a clear backfill strategy for 30 days.

Non-functional requirements

Meet the <2 hour daily SLA at P95.
Reduce EMR cost by ≥20% while improving reliability.
Improve observability: stage-level metrics, skew detection, and data quality reporting.
Support schema evolution safely without breaking downstream consumers.

Constraints

Must remain on AWS + EMR + S3 + Snowflake (no migration to Databricks this quarter).
Team skillset: strong Spark + Airflow; moderate Snowflake; limited Scala (prefer PySpark).
Compliance: PCI-adjacent environment; PII must remain encrypted at rest; audit trail required for ledger corrections.
Budget: incremental spend capped at $30K/month.

What we’re evaluating

How you diagnose Spark performance issues (skew, shuffle, file sizing, partitioning, joins, caching, AQE).
How you redesign the pipeline for incremental processing and late data.
How you make the job reliable and observable in production.
How you ensure correctness and idempotency when loading to Snowflake.

Context

Current Architecture (as-is)

Layer	Technology	Details	Pain Points
Storage (raw)	Amazon S3	JSON + some Parquet, partitioned by `dt`	Small files, mixed formats, inconsistent partitioning
Processing	Spark 3.3 on EMR	Single monolithic job, dynamic allocation enabled	Skewed shuffles, OOM, unpredictable runtime
Orchestration	Airflow 2.x	One DAG task triggers entire job	Hard to retry partially; poor observability
Warehouse	Snowflake	`COPY INTO` from S3 stage	Loads are slow; micro-partitions poorly clustered
Transformations	Some SQL in Spark	Business logic embedded in Spark code	Hard to test; brittle schema evolution

Scale Requirements

Input volume: ~25 TB/day raw, peaking at 40 TB/day on high-traffic days
Records: ~12–18B events/day across multiple event types
SLA: Curated tables queryable in Snowflake by 06:00 UTC (job must finish in <2 hours)
Backfill: Ability to reprocess last 30 days within 24 hours without impacting daily SLA
Late data: Settlement events can arrive up to 48 hours late; must be incorporated correctly
Correctness: Exactly-once semantics at the curated table level (no double-counting)

Data Characteristics

Key datasets

auth_events (authorizations)
capture_events (captures)
refund_events
dispute_events
settlement_files (batch files from banks)

Common columns

event_id (string, globally unique but duplicates exist due to retries)
txn_id (string, joins across event types)
merchant_id (string, highly skewed: top 50 merchants generate ~35% of volume)
event_ts (timestamp, event time)
ingest_ts (timestamp, arrival time)
amount, currency, country, payment_method

Known quality issues

Duplicates: same event_id repeated with identical payload
Late-arriving settlement: event_ts in the past, ingest_ts much later
Schema drift: optional fields appear/disappear; nested JSON fields change
Skew: joins on merchant_id and txn_id cause large shuffles

Your Task

Design an optimization plan and propose specific implementation changes.

Functional requirements

Produce curated Snowflake tables:
- ledger_transactions (one row per txn_id with latest state)
- ledger_entries (double-entry accounting lines)
- merchant_daily_rollups (aggregates by merchant/day/currency)
Correctly handle late-arriving settlement and dispute events (up to 48 hours) via incremental recomputation.
Ensure idempotent loads into Snowflake (retries must not create duplicates).
Provide a clear backfill strategy for 30 days.

Non-functional requirements

Meet the <2 hour daily SLA at P95.
Reduce EMR cost by ≥20% while improving reliability.
Improve observability: stage-level metrics, skew detection, and data quality reporting.
Support schema evolution safely without breaking downstream consumers.

Constraints

Must remain on AWS + EMR + S3 + Snowflake (no migration to Databricks this quarter).
Team skillset: strong Spark + Airflow; moderate Snowflake; limited Scala (prefer PySpark).
Compliance: PCI-adjacent environment; PII must remain encrypted at rest; audit trail required for ledger corrections.
Budget: incremental spend capped at $30K/month.

What we’re evaluating

How you diagnose Spark performance issues (skew, shuffle, file sizing, partitioning, joins, caching, AQE).
How you redesign the pipeline for incremental processing and late data.
How you make the job reliable and observable in production.
How you ensure correctness and idempotency when loading to Snowflake.

Context

Current Architecture (as-is)

Layer	Technology	Details	Pain Points
Storage (raw)	Amazon S3	JSON + some Parquet, partitioned by `dt`	Small files, mixed formats, inconsistent partitioning
Processing	Spark 3.3 on EMR	Single monolithic job, dynamic allocation enabled	Skewed shuffles, OOM, unpredictable runtime
Orchestration	Airflow 2.x	One DAG task triggers entire job	Hard to retry partially; poor observability
Warehouse	Snowflake	`COPY INTO` from S3 stage	Loads are slow; micro-partitions poorly clustered
Transformations	Some SQL in Spark	Business logic embedded in Spark code	Hard to test; brittle schema evolution

Scale Requirements

Input volume: ~25 TB/day raw, peaking at 40 TB/day on high-traffic days
Records: ~12–18B events/day across multiple event types
SLA: Curated tables queryable in Snowflake by 06:00 UTC (job must finish in <2 hours)
Backfill: Ability to reprocess last 30 days within 24 hours without impacting daily SLA
Late data: Settlement events can arrive up to 48 hours late; must be incorporated correctly
Correctness: Exactly-once semantics at the curated table level (no double-counting)

Data Characteristics

Key datasets

auth_events (authorizations)
capture_events (captures)
refund_events
dispute_events
settlement_files (batch files from banks)

Common columns

event_id (string, globally unique but duplicates exist due to retries)
txn_id (string, joins across event types)
merchant_id (string, highly skewed: top 50 merchants generate ~35% of volume)
event_ts (timestamp, event time)
ingest_ts (timestamp, arrival time)
amount, currency, country, payment_method

Known quality issues

Duplicates: same event_id repeated with identical payload
Late-arriving settlement: event_ts in the past, ingest_ts much later
Schema drift: optional fields appear/disappear; nested JSON fields change
Skew: joins on merchant_id and txn_id cause large shuffles

Your Task

Design an optimization plan and propose specific implementation changes.

Functional requirements

Produce curated Snowflake tables:
- ledger_transactions (one row per txn_id with latest state)
- ledger_entries (double-entry accounting lines)
- merchant_daily_rollups (aggregates by merchant/day/currency)
Correctly handle late-arriving settlement and dispute events (up to 48 hours) via incremental recomputation.
Ensure idempotent loads into Snowflake (retries must not create duplicates).
Provide a clear backfill strategy for 30 days.

Non-functional requirements

Meet the <2 hour daily SLA at P95.
Reduce EMR cost by ≥20% while improving reliability.
Improve observability: stage-level metrics, skew detection, and data quality reporting.
Support schema evolution safely without breaking downstream consumers.

Constraints

Must remain on AWS + EMR + S3 + Snowflake (no migration to Databricks this quarter).
Team skillset: strong Spark + Airflow; moderate Snowflake; limited Scala (prefer PySpark).
Compliance: PCI-adjacent environment; PII must remain encrypted at rest; audit trail required for ledger corrections.
Budget: incremental spend capped at $30K/month.

What we’re evaluating

How you diagnose Spark performance issues (skew, shuffle, file sizing, partitioning, joins, caching, AQE).
How you redesign the pipeline for incremental processing and late data.
How you make the job reliable and observable in production.
How you ensure correctness and idempotency when loading to Snowflake.

Context

Current Architecture (as-is)

Layer	Technology	Details	Pain Points
Storage (raw)	Amazon S3	JSON + some Parquet, partitioned by `dt`	Small files, mixed formats, inconsistent partitioning
Processing	Spark 3.3 on EMR	Single monolithic job, dynamic allocation enabled	Skewed shuffles, OOM, unpredictable runtime
Orchestration	Airflow 2.x	One DAG task triggers entire job	Hard to retry partially; poor observability
Warehouse	Snowflake	`COPY INTO` from S3 stage	Loads are slow; micro-partitions poorly clustered
Transformations	Some SQL in Spark	Business logic embedded in Spark code	Hard to test; brittle schema evolution

Scale Requirements

Input volume: ~25 TB/day raw, peaking at 40 TB/day on high-traffic days
Records: ~12–18B events/day across multiple event types
SLA: Curated tables queryable in Snowflake by 06:00 UTC (job must finish in <2 hours)
Backfill: Ability to reprocess last 30 days within 24 hours without impacting daily SLA
Late data: Settlement events can arrive up to 48 hours late; must be incorporated correctly
Correctness: Exactly-once semantics at the curated table level (no double-counting)

Data Characteristics

Key datasets

auth_events (authorizations)
capture_events (captures)
refund_events
dispute_events
settlement_files (batch files from banks)

Common columns

event_id (string, globally unique but duplicates exist due to retries)
txn_id (string, joins across event types)
merchant_id (string, highly skewed: top 50 merchants generate ~35% of volume)
event_ts (timestamp, event time)
ingest_ts (timestamp, arrival time)
amount, currency, country, payment_method

Known quality issues

Duplicates: same event_id repeated with identical payload
Late-arriving settlement: event_ts in the past, ingest_ts much later
Schema drift: optional fields appear/disappear; nested JSON fields change
Skew: joins on merchant_id and txn_id cause large shuffles

Your Task

Design an optimization plan and propose specific implementation changes.

Functional requirements

Produce curated Snowflake tables:
- ledger_transactions (one row per txn_id with latest state)
- ledger_entries (double-entry accounting lines)
- merchant_daily_rollups (aggregates by merchant/day/currency)
Correctly handle late-arriving settlement and dispute events (up to 48 hours) via incremental recomputation.
Ensure idempotent loads into Snowflake (retries must not create duplicates).
Provide a clear backfill strategy for 30 days.

Non-functional requirements

Meet the <2 hour daily SLA at P95.
Reduce EMR cost by ≥20% while improving reliability.
Improve observability: stage-level metrics, skew detection, and data quality reporting.
Support schema evolution safely without breaking downstream consumers.

Constraints

Must remain on AWS + EMR + S3 + Snowflake (no migration to Databricks this quarter).
Team skillset: strong Spark + Airflow; moderate Snowflake; limited Scala (prefer PySpark).
Compliance: PCI-adjacent environment; PII must remain encrypted at rest; audit trail required for ledger corrections.
Budget: incremental spend capped at $30K/month.

What we’re evaluating

How you diagnose Spark performance issues (skew, shuffle, file sizing, partitioning, joins, caching, AQE).
How you redesign the pipeline for incremental processing and late data.
How you make the job reliable and observable in production.
How you ensure correctness and idempotency when loading to Snowflake.

Interview Guides

Context

Current Architecture (as-is)

Scale Requirements

Data Characteristics

Key datasets

Common columns

Known quality issues

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Optimize Spark ETL for Ledger Loads

Context

Current Architecture (as-is)

Scale Requirements

Data Characteristics

Key datasets

Common columns

Known quality issues

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Your Answer

Optimize Spark ETL for Ledger Loads

Context

Current Architecture (as-is)

Scale Requirements

Data Characteristics

Key datasets

Common columns

Known quality issues

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Optimize Spark ETL for Ledger Loads

Context

Current Architecture (as-is)

Scale Requirements

Data Characteristics

Key datasets

Common columns

Known quality issues

Your Task

Functional requirements

Non-functional requirements

Constraints

What we’re evaluating

Your Answer