Context
FinLedger, a B2B payments platform, ingests transaction, refund, and settlement data from PostgreSQL OLTP databases, third-party payment processor APIs, and S3-delivered CSV files from banking partners. The current pipeline uses nightly Python cron jobs to load data into Snowflake, but analysts frequently find duplicate records, missing settlements, and inconsistent balances across downstream tables.
You need to redesign the pipeline so data integrity is enforced end-to-end while preserving daily financial reporting and supporting near-real-time operational reconciliation.
Scale Requirements
- Sources: 3 PostgreSQL databases, 2 external REST APIs, 1 S3 partner file feed
- Volume: 120M transaction-related records/day, ~1.5 TB raw/day
- Peak ingestion: 8K records/sec from CDC streams, 200 partner files/day
- Latency: operational reconciliation tables < 10 minutes; finance reporting tables < 2 hours
- Retention: raw immutable data for 7 years; curated warehouse tables indefinitely
Requirements
- Design an ingestion and transformation pipeline that guarantees no silent data loss and minimizes duplicates.
- Support idempotent reprocessing for failed jobs, backfills, and late-arriving partner files.
- Implement data integrity controls including schema validation, record-level deduplication, referential integrity checks, and reconciliation between source totals and warehouse totals.
- Track lineage from source extract to curated Snowflake tables for auditability.
- Orchestrate batch and micro-batch jobs with clear dependency management and retry behavior.
- Expose data quality status to finance and operations teams before downstream tables are marked ready.
- Define monitoring, alerting, and failure recovery for ingestion, transformation, and load stages.
Constraints
- AWS is the required cloud environment.
- Incremental budget is capped at $35K/month.
- PCI-related payment data must be encrypted in transit and at rest.
- Finance requires reproducible month-end reruns using historical raw snapshots.
- The team has strong SQL/Airflow skills, moderate Spark experience, and no appetite for a fully custom platform.