Context
DataCorp, a leading CRM platform, is migrating its customer data from a legacy SQL Server database to a modern cloud data warehouse (Snowflake) to improve scalability and analytics capabilities. The current ETL processes are batch-oriented, and the migration must ensure that no data is lost or corrupted during the transition.
Scale Requirements
- Data Volume: 10TB of customer records, including 5 million rows of transactional data.
- Batch Size: Migrate in batches of 100,000 rows to balance performance and resource usage.
- Latency Target: Complete migration within a 48-hour window.
- Data Quality: Ensure 99.9% accuracy and consistency in the migrated dataset.
Requirements
- Design an ETL pipeline that extracts data from SQL Server, transforms it to match Snowflake schema, and loads it into Snowflake.
- Implement data validation checks during extraction and loading phases to ensure data integrity (e.g., checksums, row counts).
- Create a rollback mechanism to revert to the previous state in case of migration failure.
- Maintain a log of migrated records, including success and failure reasons.
- Ensure that the pipeline can handle incremental changes if the source data is updated during the migration.
Constraints
- Infrastructure: Limited to existing SQL Server and Snowflake environments with no additional tools allowed.
- Budget: Must stay within a $5,000 budget for cloud resources during the migration.
- Compliance: Ensure adherence to GDPR by anonymizing sensitive data during migration.