Context
FinEdge, a payments analytics company, receives customer transaction data from two upstream systems: a PostgreSQL operational database replicated hourly and a partner SFTP feed delivered every 15 minutes as CSV. Analysts currently see conflicting values for transaction status, amount, and settlement timestamp in Snowflake, and there is no formal reconciliation layer. You need to design a pipeline that detects discrepancies, applies deterministic resolution rules, and produces a trusted canonical transactions table.
Scale Requirements
- Sources: PostgreSQL CDC + SFTP batch files
- Volume: 25M transactions/day, 300 GB raw/day
- Peak ingestion: 8K records/sec during settlement windows
- Latency target: canonical table updated within 10 minutes of source arrival
- Retention: raw data 180 days, reconciled history 7 years
- Accuracy target: >99.95% correctly reconciled records, full audit trail for 100% of conflicts
Requirements
- Ingest both sources into a raw zone without overwriting original payloads.
- Standardize schemas, data types, and time zones before comparison.
- Detect conflicts on key fields such as
transaction_id, status, amount, and settled_at.
- Implement resolution rules, for example source priority, latest event timestamp, and field-level trust scores.
- Persist both the canonical record and a reconciliation audit table showing source values, chosen value, rule applied, and processing timestamp.
- Ensure the pipeline is idempotent so reruns and backfills do not create duplicate canonical records.
- Orchestrate batch and near-real-time jobs with dependency management, retries, and SLA monitoring.
- Expose data quality metrics and unresolved conflicts for analyst review.
Constraints
- AWS-first environment with existing Airflow, dbt, and Snowflake deployments
- Small team: 3 data engineers, no dedicated platform engineer
- SOX compliance: every change to financial records must be explainable and reproducible
- Budget cap: prefer managed services over large always-on clusters
- Upstream systems cannot be modified; late files and duplicate extracts are common