Context
Meta is launching a new reporting process for cross-region business performance used by Finance, Operations, and regional strategy teams. Today, APAC, EMEA, and NAMER each run separate batch jobs and spreadsheet-based adjustments, creating inconsistent metric definitions, slow backfills, and frequent reconciliation issues.
You need to design a unified pipeline using Meta-style internal data platforms that can ingest regional source data, standardize business logic, and publish trusted reporting tables that scale across teams without duplicating logic.
Scale Requirements
- Regions: 12 global regions, 150+ country/business-unit combinations
- Source systems: 40+ upstream datasets (Ads delivery, billing, CRM, workforce, regional finance adjustments)
- Volume: ~8 TB/day raw input, 25B rows/day appended across all sources
- Freshness: hourly for operational reporting, daily certified close tables by 6 AM local region time
- Concurrency: 500+ internal dashboard users, 80+ downstream scheduled reports
- Retention: 3 years detailed data, 7 years monthly rollups for audit support
Requirements
- Build a reusable ingestion and transformation framework so new regions can onboard with configuration, not custom code.
- Standardize metric definitions, calendar alignment, currency conversion, and regional hierarchy mapping.
- Support both hourly incremental processing and large historical backfills without double counting.
- Publish curated reporting tables for internal Meta dashboards and analyst self-serve queries.
- Implement strong data quality controls for schema drift, late-arriving files, duplicate loads, and reconciliation to source totals.
- Design orchestration with dependency management across regional cutoffs and shared reference data.
- Define monitoring, alerting, and recovery procedures for failed regional runs.
Constraints
- Existing storage is in a Hive/Presto-compatible Meta data lake with scheduled workflows already managed centrally.
- Regional teams can supply mapping files and business rules, but central data engineering owns pipeline code.
- Financial reporting outputs must be auditable, reproducible, and support row-level lineage to source loads.
- Budget favors shared batch infrastructure over standing real-time systems; avoid region-specific bespoke pipelines.