Context
FinTechCorp, a financial services company, processes millions of transactions daily. Currently, their ETL pipeline employs at-least-once semantics, leading to duplicate entries in their data warehouse (Snowflake) and complicating downstream analytics. The VP of Data Engineering has mandated a redesign to achieve exactly-once processing semantics to enhance data integrity and reliability.
Scale Requirements
- Throughput: 1M transactions per second during peak hours.
- Latency: Data should be available for querying in Snowflake within 2 minutes of ingestion.
- Storage: Approximately 20 TB of data daily, requiring efficient storage management and cost control.
Requirements
- Implement a streaming ingestion layer that guarantees exactly-once delivery of transaction records.
- Utilize a message broker (e.g., Apache Kafka) with idempotent producers to prevent duplicates.
- Design a transformation layer that applies business logic while maintaining data integrity.
- Ensure data is loaded into Snowflake with proper deduplication and error handling mechanisms.
- Establish monitoring and alerting for data quality issues, including duplicate detection and processing failures.
Constraints
- Team: 5 data engineers, limited experience with Kafka.
- Infrastructure: AWS-based, leveraging existing resources (EC2, S3, Snowflake).
- Budget: $30K/month for cloud services, including storage and compute.