Partitioning vs Bucketing at Scale

Context

At Meta, the Ads Insights team stores large fact tables in a Hive-compatible lake and processes them with Apache Spark. A daily reporting pipeline for impression, click, and conversion events has become slow and expensive because tables are laid out inconsistently across date, region, and advertiser dimensions.

You are asked to redesign the storage layout and downstream ETL strategy, and clearly explain the difference between partitioning and bucketing in Hive or Spark. Your design should show when each technique is appropriate, how they interact with query patterns, and how to avoid small-file and skew issues.

Scale Requirements

Input volume: 12B events/day
Raw size: ~25 TB/day compressed Parquet
Peak batch window: 3 hours for daily backfill + compaction
Query SLA: common analyst queries under 30 seconds
Retention: 180 days hot, 2 years cold
Join pattern: daily fact joins with 200M-row advertiser and campaign dimensions

Requirements

Design a batch ETL pipeline in Spark that ingests raw event logs, validates schema, and writes curated Hive tables.
Explain the difference between partitioning and bucketing in terms of physical layout, pruning, shuffle reduction, and operational trade-offs.
Propose a table design for a large fact table and at least one dimension table, including which columns to partition and/or bucket on.
Show how your design supports common filters such as event_date, region, and joins on advertiser_id.
Include compaction, backfill, and idempotent rerun behavior.
Define monitoring for data quality, partition health, and job performance.

Constraints

Storage format must remain Hive/Spark compatible and queryable from Presto/Trino-style engines used internally at Meta.
Budget pressure requires minimizing unnecessary shuffle and file counts.
Late-arriving data up to 48 hours must be supported.
Schema evolution must be backward compatible for downstream consumers.

Context

Scale Requirements

Input volume: 12B events/day
Raw size: ~25 TB/day compressed Parquet
Peak batch window: 3 hours for daily backfill + compaction
Query SLA: common analyst queries under 30 seconds
Retention: 180 days hot, 2 years cold
Join pattern: daily fact joins with 200M-row advertiser and campaign dimensions

Requirements

Design a batch ETL pipeline in Spark that ingests raw event logs, validates schema, and writes curated Hive tables.
Explain the difference between partitioning and bucketing in terms of physical layout, pruning, shuffle reduction, and operational trade-offs.
Propose a table design for a large fact table and at least one dimension table, including which columns to partition and/or bucket on.
Show how your design supports common filters such as event_date, region, and joins on advertiser_id.
Include compaction, backfill, and idempotent rerun behavior.
Define monitoring for data quality, partition health, and job performance.

Constraints

Storage format must remain Hive/Spark compatible and queryable from Presto/Trino-style engines used internally at Meta.
Budget pressure requires minimizing unnecessary shuffle and file counts.
Late-arriving data up to 48 hours must be supported.
Schema evolution must be backward compatible for downstream consumers.

Context

Scale Requirements

Input volume: 12B events/day
Raw size: ~25 TB/day compressed Parquet
Peak batch window: 3 hours for daily backfill + compaction
Query SLA: common analyst queries under 30 seconds
Retention: 180 days hot, 2 years cold
Join pattern: daily fact joins with 200M-row advertiser and campaign dimensions

Requirements

Design a batch ETL pipeline in Spark that ingests raw event logs, validates schema, and writes curated Hive tables.
Explain the difference between partitioning and bucketing in terms of physical layout, pruning, shuffle reduction, and operational trade-offs.
Propose a table design for a large fact table and at least one dimension table, including which columns to partition and/or bucket on.
Show how your design supports common filters such as event_date, region, and joins on advertiser_id.
Include compaction, backfill, and idempotent rerun behavior.
Define monitoring for data quality, partition health, and job performance.

Constraints

Storage format must remain Hive/Spark compatible and queryable from Presto/Trino-style engines used internally at Meta.
Budget pressure requires minimizing unnecessary shuffle and file counts.
Late-arriving data up to 48 hours must be supported.
Schema evolution must be backward compatible for downstream consumers.

Context

Scale Requirements

Input volume: 12B events/day
Raw size: ~25 TB/day compressed Parquet
Peak batch window: 3 hours for daily backfill + compaction
Query SLA: common analyst queries under 30 seconds
Retention: 180 days hot, 2 years cold
Join pattern: daily fact joins with 200M-row advertiser and campaign dimensions

Requirements

Design a batch ETL pipeline in Spark that ingests raw event logs, validates schema, and writes curated Hive tables.
Explain the difference between partitioning and bucketing in terms of physical layout, pruning, shuffle reduction, and operational trade-offs.
Propose a table design for a large fact table and at least one dimension table, including which columns to partition and/or bucket on.
Show how your design supports common filters such as event_date, region, and joins on advertiser_id.
Include compaction, backfill, and idempotent rerun behavior.
Define monitoring for data quality, partition health, and job performance.

Constraints

Storage format must remain Hive/Spark compatible and queryable from Presto/Trino-style engines used internally at Meta.
Budget pressure requires minimizing unnecessary shuffle and file counts.
Late-arriving data up to 48 hours must be supported.
Schema evolution must be backward compatible for downstream consumers.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Partitioning vs Bucketing at Scale

Context

Scale Requirements

Requirements

Constraints

Your Answer

Partitioning vs Bucketing at Scale

Context

Scale Requirements

Requirements

Constraints

Partitioning vs Bucketing at Scale

Context

Scale Requirements

Requirements

Constraints

Your Answer