Context
MSCI’s analytics and index teams depend on daily and intraday security master, market data, and ESG inputs to power downstream datasets consumed in MSCI Data Lake and client-facing analytics products. The current pipeline is a mix of scheduled batch jobs and ad hoc reload scripts, which has led to occasional duplicate loads, weak lineage visibility, and inconsistent security controls across environments.
You are responsible for redesigning this pipeline so it is both reliable and secure while preserving delivery SLAs for downstream consumers.
Scale Requirements
- Sources: 25 upstream providers and internal MSCI systems
- Volume: 8 TB/day raw data, ~120 million records/day
- Cadence: hourly intraday loads plus one end-of-day batch
- Latency: intraday datasets available in <15 minutes from source arrival
- Retention: 7 years for curated data, 90 days for raw landing files
- Availability target: 99.9% successful scheduled pipeline runs per month
Requirements
- Design an ingestion and transformation pipeline for batch and micro-batch feeds into MSCI Data Lake and curated warehouse layers.
- Ensure idempotent processing so reruns, retries, and backfills do not create duplicates or corrupt downstream tables.
- Implement security controls for data in transit and at rest, including role-based access, secrets management, and auditability.
- Add data quality checks for schema drift, null spikes, duplicate business keys, and reconciliation against source control totals.
- Orchestrate dependencies across ingestion, validation, transformation, and publish steps with clear failure isolation.
- Support replay/backfill for a single source, date range, or partition without impacting unrelated pipelines.
- Provide monitoring, alerting, and operational runbooks for on-call engineers.
Constraints
- Primary cloud footprint is AWS.
- Downstream consumers already query curated datasets from Snowflake and MSCI internal analytics surfaces.
- Some feeds contain material non-public or licensed data and must be segregated by entitlement.
- Team size is 5 engineers; solution should minimize bespoke operational overhead.
- Budget allows managed services where they materially reduce reliability or security risk.