Context
FinLedger, a B2B payments platform, ingests transaction, refund, and settlement data from PostgreSQL, partner SFTP drops, and Kafka event streams into Snowflake for finance and operations reporting. The current nightly ETL pipeline frequently misses SLA during month-end close, and duplicate or partially loaded records have caused reconciliation issues.
You need to redesign the pipeline to handle larger datasets while preserving strict data integrity across batch and near-real-time loads.
Scale Requirements
- Sources: 3 PostgreSQL OLTP databases, 2 Kafka topics, 1 daily SFTP file feed
- Volume: 2.5B records/month, ~6 TB raw compressed data/month
- Peak batch load: 180M rows in a 4-hour nightly window
- Streaming rate: 25K events/sec peak for payment status updates
- Latency target: <15 minutes for streaming data, <4 hours for nightly batch completion
- Retention: 7 years for curated finance tables, 180 days for raw landing data
Requirements
- Design an ingestion and transformation pipeline that supports both nightly batch ETL and incremental streaming updates.
- Ensure data integrity through schema validation, deduplication, idempotent reprocessing, and referential consistency between transactions, refunds, and settlements.
- Support backfills for up to 90 days without corrupting downstream aggregates or creating duplicates.
- Load analytics-ready tables into Snowflake with audit columns, lineage metadata, and reproducible transformations.
- Orchestrate dependencies so downstream finance models run only after source completeness and quality checks pass.
- Provide monitoring for freshness, row-count drift, duplicate rates, failed loads, and reconciliation mismatches.
- Include a recovery strategy for partial batch failures, late-arriving files, and schema changes.
Constraints
- Existing stack is AWS-based and should be reused where practical.
- Finance data is SOX-auditable; every load must be traceable by batch_id and source file/offset.
- Budget allows moderate cluster autoscaling, but not a full platform rewrite.
- Source systems cannot tolerate heavy read load during business hours.