Context
Meta’s Ads Insights platform runs a daily batch pipeline to transform raw ad impression, click, and conversion logs into reporting tables consumed by internal analytics surfaces. The current pipeline is orchestrated in Apache Airflow, executes Spark SQL and Presto/Hive jobs on a Hadoop data lake, and writes curated datasets to Hive tables used by downstream dashboards.
A critical DAG that previously finished in 70 minutes now takes 3.5-4 hours, causing missed SLAs for morning reporting. You are asked to design a systematic debugging and remediation approach for the slow pipeline, not just optimize a single query.
Scale Requirements
- Input volume: 22 TB/day of compressed logs across impressions, clicks, and conversions
- Daily rows: ~45 billion records
- Peak partition size: 1.8 TB for a single
ds partition
- SLA: Curated tables available by 06:00 PT
- Current runtime: 210-240 minutes; target runtime < 90 minutes
- Cluster: 120-worker Spark/YARN cluster, 64 vCPU and 256 GB RAM per worker
- Retention: 180 days raw, 2 years aggregated
Requirements
- Propose an end-to-end debugging plan to isolate whether the slowdown is caused by orchestration, ingestion, skewed joins, small files, resource contention, or downstream writes.
- Define what metrics, logs, and lineage you would inspect first in Airflow, Spark UI, Hive Metastore, and storage.
- Design a repeatable method to compare current vs historical runs and identify regression points.
- Recommend concrete fixes for at least query planning, partitioning, file layout, and cluster resource allocation.
- Include data quality protections so performance fixes do not silently change row counts or business metrics.
- Explain how you would safely roll out improvements and validate SLA recovery.
Constraints
- Must continue using Airflow, Spark, Hive, and Presto in the near term
- No full platform migration during the incident window
- Changes must preserve existing table schemas and downstream compatibility
- Cost increase should stay under 15% of current monthly compute spend
- Backfills for the last 7 days may be required after remediation