You are building a new analytics pipeline for a SaaS platform that stores billing and account data in a transactional relational database, while product activity and user preferences live in a document store. Finance and product teams are reporting conflicting customer metrics because the current daily exports flatten each source differently and drop nested attributes. Leadership wants a Snowflake-based pipeline that preserves the strengths of both source models while producing consistent analytics tables. The immediate trigger is an executive escalation after monthly active customer counts differed by 8% across dashboards.
| Component | Status / Technology |
|---|---|
| Transactional source | PostgreSQL with normalized customer, subscription, and invoice tables |
| Operational event source | MongoDB with nested user profile and activity documents |
| Ingestion | Nightly Python scripts dumping CSV and JSON to Snowflake internal stages |
| Transformation | Ad hoc SQL scripts run manually in Snowflake worksheets |
| Warehouse | Snowflake raw and reporting schemas |
| Orchestration | Basic cron jobs on a VM |
Scale: ~250 GB/day new data, 120M PostgreSQL rows, 400M MongoDB documents, nightly batch window of 3 hours, dashboard freshness target of under 30 minutes after load completion.
How would you redesign this pipeline in Snowflake so relational and non-relational data can be ingested, modeled, and reconciled consistently for analytics? Explain how you would handle schema differences, incremental processing, and data quality so downstream teams get one trusted set of customer metrics.