You are building a new analytics pipeline for a marketplace platform that stores orders in a transactional relational database and product catalog metadata in a document database. Finance and operations reports currently disagree because structured order facts are loaded nightly while semi-structured catalog attributes are flattened inconsistently by different teams. Leadership wants a single Snowflake model that preserves transactional integrity from the relational source while still supporting flexible ingestion of evolving non-relational fields. The immediate pain point is a month-end reconciliation issue caused by missing and duplicated product attributes in downstream reporting.
| Component | Status |
|---|---|
| Order System | PostgreSQL 14, normalized OLTP schema |
| Catalog System | MongoDB 6.0, nested JSON documents |
| Ingestion | Nightly exports to cloud object storage |
| Warehouse | Snowflake raw and curated databases |
| Transformations | Snowflake Tasks + dbt incremental models |
| Data Quality | Manual SQL checks, no centralized alerts |
Scale: ~120M order rows, ~15M product documents, 8M daily order changes, 500K daily catalog updates, 2-hour freshness target, 2-year history retention.
How would you design the Snowflake-centric ELT pipeline so that relational order data and non-relational catalog data can be ingested, modeled, and reconciled consistently for analytics? Explain how your design would handle schema evolution, joins between structured and semi-structured data, and data quality validation at this scale.