You’re interviewing for a Senior Data Engineer role on the Risk & Reconciliation platform at PayWave, a fintech that processes card payments and instant bank transfers across North America and Europe. PayWave has 18M monthly active customers and 220K merchants. The company’s revenue recognition and chargeback workflows depend on a daily “ledger truth” dataset that reconciles authorization, capture, refund, dispute, and settlement events.
Today, PayWave runs a large daily Spark batch job that ingests raw events from an S3-based data lake, performs deduplication and enrichment, and loads curated tables into Snowflake for Finance and Risk analysts. Over the last quarter, the job runtime has regressed from 2.5 hours to 7–9 hours, repeatedly missing the 06:00 UTC reporting cutoff. When the cutoff is missed, the Finance team delays settlement reporting and the Risk team loses the ability to detect anomalous merchant behavior before morning processing—creating direct regulatory and fraud exposure.
The on-call notes show frequent executor OOMs, long GC pauses, and stages with extreme skew (single tasks running 30–60 minutes). The pipeline also struggles with late-arriving settlement files (up to 48 hours late) and duplicate events from upstream retries. You are asked to propose a concrete optimization plan and the production-ready changes you’d make.
| Layer | Technology | Details | Pain Points |
|---|---|---|---|
| Storage (raw) | Amazon S3 | JSON + some Parquet, partitioned by dt | Small files, mixed formats, inconsistent partitioning |
| Processing | Spark 3.3 on EMR | Single monolithic job, dynamic allocation enabled | Skewed shuffles, OOM, unpredictable runtime |
| Orchestration | Airflow 2.x | One DAG task triggers entire job | Hard to retry partially; poor observability |
| Warehouse | Snowflake | COPY INTO from S3 stage | Loads are slow; micro-partitions poorly clustered |
| Transformations | Some SQL in Spark | Business logic embedded in Spark code | Hard to test; brittle schema evolution |
auth_events (authorizations)capture_events (captures)refund_eventsdispute_eventssettlement_files (batch files from banks)event_id (string, globally unique but duplicates exist due to retries)txn_id (string, joins across event types)merchant_id (string, highly skewed: top 50 merchants generate ~35% of volume)event_ts (timestamp, event time)ingest_ts (timestamp, arrival time)amount, currency, country, payment_methodevent_id repeated with identical payloadevent_ts in the past, ingest_ts much latermerchant_id and txn_id cause large shufflesDesign an optimization plan and propose specific implementation changes.
ledger_transactions (one row per txn_id with latest state)ledger_entries (double-entry accounting lines)merchant_daily_rollups (aggregates by merchant/day/currency)