Design Data Contracts for Pipelines

Context

Meta’s Ads Insights platform ingests campaign performance events from multiple producers, including Ads Delivery, Billing, and conversion reporting systems. Today, downstream Hive and Presto consumers often discover breaking schema changes only after scheduled Airflow backfills or near-real-time Flink jobs fail, causing dashboard delays and inconsistent metrics.

You are asked to design a data contract framework for shared pipelines so producer teams can evolve schemas safely while consumer teams retain predictable SLAs. The goal is to define what a data contract is, who owns it, and how contract enforcement should work across batch and streaming systems.

Scale Requirements

Producers: 40+ upstream datasets across Ads and Measurement
Volume: 3B events/day, peak 250K events/sec on streaming topics
Consumers: 150+ downstream tables, ML features, and reporting jobs
Latency: streaming validation under 2 minutes; batch contract checks before hourly publication
Retention: 180 days raw, 3 years curated aggregates

Requirements

Define the contents of a data contract for Meta pipelines: schema, semantics, freshness SLA, quality thresholds, ownership, and change policy.
Specify ownership boundaries between producer teams, central data platform, and downstream consumers.
Design enforcement for both streaming ingestion (Kafka/Flink) and batch publication (Hive/Spark).
Support backward-compatible schema evolution, versioning, and deprecation windows.
Prevent bad data from reaching curated datasets using automated validation, quarantine, and rollback paths.
Expose contract status, violations, and lineage in a way that on-call engineers and analysts can act on quickly.
Describe how orchestration should block downstream publishes when contract checks fail, while still allowing controlled overrides.

Constraints

Prefer Meta-adjacent infrastructure: Kafka, Apache Flink, Spark, Hive, Presto, Airflow-like orchestration, and internal metadata services.
Some producers are low-maturity teams and cannot manually coordinate every schema change.
Contract checks must add less than 5% compute overhead to existing pipelines.
PII fields require explicit classification and deletion compatibility.

Your answer should explain the contract model, ownership model, enforcement architecture, and operational playbook.

Context

Scale Requirements

Producers: 40+ upstream datasets across Ads and Measurement
Volume: 3B events/day, peak 250K events/sec on streaming topics
Consumers: 150+ downstream tables, ML features, and reporting jobs
Latency: streaming validation under 2 minutes; batch contract checks before hourly publication
Retention: 180 days raw, 3 years curated aggregates

Requirements

Define the contents of a data contract for Meta pipelines: schema, semantics, freshness SLA, quality thresholds, ownership, and change policy.
Specify ownership boundaries between producer teams, central data platform, and downstream consumers.
Design enforcement for both streaming ingestion (Kafka/Flink) and batch publication (Hive/Spark).
Support backward-compatible schema evolution, versioning, and deprecation windows.
Prevent bad data from reaching curated datasets using automated validation, quarantine, and rollback paths.
Expose contract status, violations, and lineage in a way that on-call engineers and analysts can act on quickly.
Describe how orchestration should block downstream publishes when contract checks fail, while still allowing controlled overrides.

Constraints

Prefer Meta-adjacent infrastructure: Kafka, Apache Flink, Spark, Hive, Presto, Airflow-like orchestration, and internal metadata services.
Some producers are low-maturity teams and cannot manually coordinate every schema change.
Contract checks must add less than 5% compute overhead to existing pipelines.
PII fields require explicit classification and deletion compatibility.

Your answer should explain the contract model, ownership model, enforcement architecture, and operational playbook.

Context

Scale Requirements

Producers: 40+ upstream datasets across Ads and Measurement
Volume: 3B events/day, peak 250K events/sec on streaming topics
Consumers: 150+ downstream tables, ML features, and reporting jobs
Latency: streaming validation under 2 minutes; batch contract checks before hourly publication
Retention: 180 days raw, 3 years curated aggregates

Requirements

Define the contents of a data contract for Meta pipelines: schema, semantics, freshness SLA, quality thresholds, ownership, and change policy.
Specify ownership boundaries between producer teams, central data platform, and downstream consumers.
Design enforcement for both streaming ingestion (Kafka/Flink) and batch publication (Hive/Spark).
Support backward-compatible schema evolution, versioning, and deprecation windows.
Prevent bad data from reaching curated datasets using automated validation, quarantine, and rollback paths.
Expose contract status, violations, and lineage in a way that on-call engineers and analysts can act on quickly.
Describe how orchestration should block downstream publishes when contract checks fail, while still allowing controlled overrides.

Constraints

Prefer Meta-adjacent infrastructure: Kafka, Apache Flink, Spark, Hive, Presto, Airflow-like orchestration, and internal metadata services.
Some producers are low-maturity teams and cannot manually coordinate every schema change.
Contract checks must add less than 5% compute overhead to existing pipelines.
PII fields require explicit classification and deletion compatibility.

Your answer should explain the contract model, ownership model, enforcement architecture, and operational playbook.

Context

Scale Requirements

Producers: 40+ upstream datasets across Ads and Measurement
Volume: 3B events/day, peak 250K events/sec on streaming topics
Consumers: 150+ downstream tables, ML features, and reporting jobs
Latency: streaming validation under 2 minutes; batch contract checks before hourly publication
Retention: 180 days raw, 3 years curated aggregates

Requirements

Define the contents of a data contract for Meta pipelines: schema, semantics, freshness SLA, quality thresholds, ownership, and change policy.
Specify ownership boundaries between producer teams, central data platform, and downstream consumers.
Design enforcement for both streaming ingestion (Kafka/Flink) and batch publication (Hive/Spark).
Support backward-compatible schema evolution, versioning, and deprecation windows.
Prevent bad data from reaching curated datasets using automated validation, quarantine, and rollback paths.
Expose contract status, violations, and lineage in a way that on-call engineers and analysts can act on quickly.
Describe how orchestration should block downstream publishes when contract checks fail, while still allowing controlled overrides.

Constraints

Prefer Meta-adjacent infrastructure: Kafka, Apache Flink, Spark, Hive, Presto, Airflow-like orchestration, and internal metadata services.
Some producers are low-maturity teams and cannot manually coordinate every schema change.
Contract checks must add less than 5% compute overhead to existing pipelines.
PII fields require explicit classification and deletion compatibility.

Your answer should explain the contract model, ownership model, enforcement architecture, and operational playbook.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Data Contracts for Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Data Contracts for Pipelines

Context

Scale Requirements

Requirements

Constraints

Design Data Contracts for Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer