Context
Meta's internal operations teams need a scalable pipeline to support system reporting, audit readiness, security administration, recurring data loads, and warehouse reporting across products such as Meta Business Suite and internal admin surfaces. Today, reporting is split across ad hoc SQL jobs, manual exports, and point-to-point scripts, creating inconsistent metrics, weak lineage, and poor controls for audit and access reviews.
Design a centralized data pipeline that ingests operational logs, access-control changes, admin actions, and reference data from Meta internal systems into a governed warehouse for reporting and compliance use cases.
Scale Requirements
- Sources: 40+ internal systems, including access logs, admin events, HR/user directory, and entitlement systems
- Volume: 12 TB/day raw data, 2.5B records/day, peak ingest 80K records/sec
- Latency: security and audit dashboards < 15 minutes; warehouse reporting by 6 AM daily
- Retention: 7 years for audit logs, 13 months hot queryable storage
- Availability: 99.9% pipeline success for daily SLAs
Requirements
- Build ingestion for both batch extracts and near-real-time event streams from internal Meta systems.
- Standardize schemas for audit events, user/admin actions, entitlement changes, and reporting dimensions.
- Support idempotent reprocessing, late-arriving data, and historical backfills without double counting.
- Produce curated warehouse tables for system reporting, audit evidence, security administration reviews, and executive KPI reporting.
- Implement data quality checks for completeness, referential integrity, schema drift, and duplicate events.
- Enforce role-based access, column-level protection for sensitive fields, and full lineage for auditability.
- Define orchestration, monitoring, and incident response for missed loads and broken dependencies.
Constraints
- Prefer Meta-native/internal platforms where possible, but you may reference industry-standard equivalents for clarity.
- The solution must separate raw, validated, and curated layers.
- Compliance requirements include SOX-style auditability, least-privilege access, and reproducible historical reporting.
- Team size is limited: 5 data engineers and 1 analytics engineer, so operational simplicity matters.