You are rebuilding a finance data pipeline that consolidates transactions, vendor master data, purchase orders, and general ledger balances from multiple ERP systems into a single reporting layer. The current process relies on separate nightly extracts and spreadsheet reconciliations, and finance leadership has escalated repeated mismatches between management reports and source ledgers. Recent audit findings highlighted duplicate records, inconsistent chart-of-accounts mappings, and missing lineage for manual corrections. You need a pipeline that preserves data integrity across systems while supporting reliable month-end close and daily reporting.
| Component | Status |
|---|---|
| Source Systems | SAP S/4HANA, SAP ECC, and a regional Oracle ERP instance |
| Ingestion | Nightly CSV exports over SFTP and limited JDBC pulls |
| Processing | Python ETL scripts on virtual machines |
| Storage | Raw files in object storage and finance marts in Snowflake |
| Orchestration | Apache Airflow 2.x |
| Data Quality | Manual reconciliations and ad hoc SQL checks |
Scale: ~120 source tables, 35M ledger and subledger rows/day, 8 years of historical backfill, daily refresh target by 06:00 UTC, and month-end peak volume 2.5x normal load.
How would you design this pipeline so that data pulled from multiple ERP systems remains complete, consistent, and auditable from ingestion through reporting, while still handling schema differences, reprocessing, and month-end spikes?