Context
Northstar Manufacturing runs SAP ECC for finance, procurement, and order management, but analytics teams now use Snowflake for reporting and forecasting. The current integration relies on nightly CSV exports and custom Python scripts, causing stale data, duplicate loads, and frequent failures when SAP tables change or large backfills are required.
You are asked to design a production-grade pipeline that extracts data from SAP and related legacy ERP interfaces, lands it in a cloud data platform, and publishes analytics-ready models with strong data quality guarantees.
Scale Requirements
- Source systems: SAP ECC + 3 legacy ERP side systems
- Tables/interfaces: ~450 SAP tables, 120 high-priority
- Daily volume: 2.5 TB raw extracts, ~8 billion rows/day
- Change rate: 150M changed rows/day across finance, inventory, and sales
- Latency target: < 15 minutes for priority entities, < 4 hours for long-tail tables
- Backfill requirement: Reprocess 24 months of history within 72 hours
- Retention: Raw zone 1 year, curated warehouse 7 years
Requirements
- Design ingestion for both full extracts and CDC/incremental loads from SAP tables and IDoc/BAPI-based interfaces.
- Handle common bottlenecks: source-side locking, limited SAP extractor windows, schema drift, duplicate records, and slow downstream warehouse merges.
- Build bronze/silver/gold layers with auditability, replay support, and idempotent processing.
- Support SCD handling for master data and near-real-time fact updates for orders, deliveries, and inventory movements.
- Define orchestration, dependency management, and backfill strategy across hundreds of entities.
- Implement data quality checks for completeness, reconciliation against SAP control totals, and freshness SLAs.
- Expose curated tables for BI and finance reporting without disrupting SAP production workloads.
Constraints
- Existing cloud stack is AWS; Snowflake is the enterprise warehouse.
- SAP Basis team allows only limited extraction windows and strict API concurrency caps.
- Budget increase is capped at $35K/month.
- SOX compliance requires lineage, load audit logs, and reproducible historical snapshots.