Context
FinSight, a mid-market fintech company, receives daily settlement, chargeback, and fee files from 18 payment partners. Today, analysts manually download CSV and Excel files from SFTP and email attachments, clean them in Python notebooks, and upload results into Snowflake. The process takes 4-5 hours per day, causes frequent schema errors, and delays finance reporting.
You need to design an automated pipeline that replaces this manual workflow with a reliable, auditable ingestion and transformation system on AWS.
Scale Requirements
- Sources: 18 vendors, 120-180 files/day
- File sizes: 10 MB to 4 GB per file
- Daily volume: ~350 GB raw, ~8 TB retained over 24 months
- Latency target: Files queryable in Snowflake within 15 minutes of arrival
- Data quality SLA: >99.5% successful loads, duplicate rate <0.1%
- Backfill requirement: Reprocess any vendor/date range for the last 2 years
Requirements
- Build an automated ingestion process for SFTP drops and email-delivered attachments into a centralized raw storage layer.
- Standardize heterogeneous CSV/XLSX formats into a canonical schema for settlements, chargebacks, and fees.
- Ensure idempotent processing so the same file is never loaded twice, even if re-sent by a vendor.
- Add data quality checks for schema drift, row-count anomalies, null spikes, and invalid currency/date fields.
- Orchestrate ingestion, transformation, retries, and backfills with clear dependency management.
- Load curated tables into Snowflake for finance and operations reporting.
- Provide operational monitoring, alerting, and auditability for every file processed.
Constraints
- AWS is the required cloud environment.
- The team has 3 data engineers and cannot maintain a large custom platform.
- Budget for incremental infrastructure is capped at $18K/month.
- Some files contain PII and must be encrypted at rest and in transit.
- Finance requires a full processing audit trail, including who sent the file, when it arrived, checksum, row counts, and load status.
In your design, explain the end-to-end architecture, orchestration strategy, schema management, data quality framework, and how you would measure the impact of automating this manual process.