Context
At Meta, the Ads Insights team stores large fact tables in a Hive-compatible lake and processes them with Apache Spark. A daily reporting pipeline for impression, click, and conversion events has become slow and expensive because tables are laid out inconsistently across date, region, and advertiser dimensions.
You are asked to redesign the storage layout and downstream ETL strategy, and clearly explain the difference between partitioning and bucketing in Hive or Spark. Your design should show when each technique is appropriate, how they interact with query patterns, and how to avoid small-file and skew issues.
Scale Requirements
- Input volume: 12B events/day
- Raw size: ~25 TB/day compressed Parquet
- Peak batch window: 3 hours for daily backfill + compaction
- Query SLA: common analyst queries under 30 seconds
- Retention: 180 days hot, 2 years cold
- Join pattern: daily fact joins with 200M-row advertiser and campaign dimensions
Requirements
- Design a batch ETL pipeline in Spark that ingests raw event logs, validates schema, and writes curated Hive tables.
- Explain the difference between partitioning and bucketing in terms of physical layout, pruning, shuffle reduction, and operational trade-offs.
- Propose a table design for a large fact table and at least one dimension table, including which columns to partition and/or bucket on.
- Show how your design supports common filters such as
event_date, region, and joins on advertiser_id.
- Include compaction, backfill, and idempotent rerun behavior.
- Define monitoring for data quality, partition health, and job performance.
Constraints
- Storage format must remain Hive/Spark compatible and queryable from Presto/Trino-style engines used internally at Meta.
- Budget pressure requires minimizing unnecessary shuffle and file counts.
- Late-arriving data up to 48 hours must be supported.
- Schema evolution must be backward compatible for downstream consumers.