Context
Northstar Manufacturing runs a 20-year-old on-prem ERP on Oracle 11g and several flat-file exports from warehouse management and finance systems. Today, analysts rely on nightly CSV drops and manual SQL scripts, causing 12-24 hour delays, duplicate loads, and inconsistent customer/order reporting in the cloud data platform.
You need to design a production-grade pipeline that integrates these legacy systems with modern cloud applications on AWS and Snowflake, while preserving source-of-truth accuracy and supporting incremental migration away from batch file transfers.
Scale Requirements
- Sources: Oracle ERP, SFTP CSV exports, and a SOAP-based order status API
- Volume: 250M ERP rows, 8M change events/day, 1.5 TB/day raw ingest
- Batch cadence: core finance tables every 15 minutes; full reconciliation nightly
- Latency target: critical order/inventory data queryable in Snowflake within 10 minutes
- Retention: 7 years for finance data, 180 days for raw landing files
- Availability: 99.9% pipeline success for scheduled runs
Requirements
- Build ingestion for Oracle CDC, SFTP-delivered CSV files, and periodic SOAP API pulls.
- Support both ETL/ELT patterns: raw landing, standardized staging, and curated business tables.
- Ensure idempotent loads, late-arriving data handling, and replay/backfill for 30 days.
- Implement schema drift detection, row-count reconciliation, and duplicate detection before publishing curated tables.
- Orchestrate dependencies across ingestion, validation, transformation, and downstream data product refreshes.
- Expose analytics-ready models for orders, invoices, inventory snapshots, and customer master data.
- Provide monitoring, alerting, and operational recovery procedures for partial failures.
Constraints
- AWS is the mandated cloud; Snowflake is the enterprise warehouse.
- Legacy Oracle cannot tolerate more than 5% additional load during business hours.
- Budget cap: $35K/month incremental infrastructure spend.
- SOX compliance requires auditable lineage, load history, and controlled access to finance data.
- The team has Airflow and dbt experience, but limited expertise with custom CDC tooling.