Context
ShopSphere, a global marketplace, stores transactional order data in a single PostgreSQL instance that now struggles with write contention and long-running ETL extracts. The platform plans to shard the operational database by customer_id across multiple PostgreSQL clusters, and the data engineering team must redesign downstream pipelines so analytics and finance reporting continue to work without data loss or duplicate processing.
You are asked to design a shard-aware batch and near-real-time ingestion pipeline that explains how database sharding changes extraction, transformation, reconciliation, and monitoring. The solution should preserve a unified analytical model in Snowflake while handling shard growth over time.
Scale Requirements
- Operational traffic: 45K writes/sec peak, 8K reads/sec peak
- Initial shard count: 16 PostgreSQL shards, growing to 64 within 12 months
- Data volume: 2.5B orders/year, 12 TB raw change data/month
- Latency target: CDC to Snowflake in < 10 minutes; daily finance aggregates by 6:00 AM UTC
- Retention: 13 months hot in Snowflake, 7 years archived in S3
Requirements
- Design ingestion from sharded PostgreSQL so each shard can be extracted independently without full-table scans.
- Explain how sharding improves write scalability, parallel ETL throughput, and operational isolation.
- Build a unified
fact_orders model in Snowflake that hides shard boundaries from analysts.
- Support incremental loads, backfills, and shard rebalancing without double-counting records.
- Implement data quality checks for completeness, duplicate primary keys, and cross-shard consistency.
- Define orchestration, monitoring, and recovery for shard-specific failures.
- Show how new shards are onboarded with minimal pipeline changes.
Constraints
- AWS-first stack; existing tools are Airflow, Debezium, Kafka, dbt, Snowflake, and S3.
- PCI scope applies to payment-related columns; sensitive fields must be tokenized before landing in analytics.
- Budget allows moderate Kafka/Snowflake growth, but no custom platform rewrite.
- Finance reports must remain reproducible even after shard splits or rebalancing.