Context
BrightCart, a DTC retail company, uses Google Ads, Meta Ads, HubSpot, Shopify, and GA4 to measure campaign performance. Today, analysts manually export CSVs and run ad hoc SQL, causing 24-hour delays, inconsistent attribution logic, and frequent metric mismatches between finance and marketing.
Design a production-grade marketing analytics pipeline that centralizes paid media, web analytics, CRM, and order data into a warehouse for daily and near-real-time reporting. The goal is to support campaign, channel, and cohort analysis with reliable definitions for spend, clicks, sessions, leads, orders, CAC, and ROAS.
Scale Requirements
- Sources: 8-12 SaaS APIs initially, expanding to 20 within 12 months
- Volume: ~50M ad/impression/click/session events per day, 2M CRM updates/day, 500K orders/day
- Latency: API-source data available in warehouse within 30 minutes; daily finance-grade aggregates by 7:00 AM UTC
- Storage: 5 TB raw history initially, growing to 40 TB in 2 years
- Freshness SLA: 99% of source extracts complete within 20 minutes of schedule
Requirements
- Ingest data from Google Ads, Meta Ads, GA4, HubSpot, and Shopify with incremental extraction using API cursors or updated timestamps.
- Land raw data in cloud object storage, preserve source payloads, and support replay/backfills for 180 days.
- Load raw data into Snowflake and build standardized fact/dimension models for campaigns, ad groups, creatives, sessions, leads, and orders.
- Implement identity mapping across
utm_*, click IDs, email, and customer IDs to support multi-touch marketing analysis.
- Add data quality checks for schema drift, duplicate loads, null spikes, spend anomalies, and reconciliation against source totals.
- Orchestrate dependencies, retries, and backfills, and expose dashboards used by marketing and finance.
Constraints
- AWS is the required cloud; Snowflake is already the enterprise warehouse.
- Team size is 3 data engineers and 2 analytics engineers.
- Incremental infrastructure budget is capped at $18K/month.
- PII must be encrypted at rest and masked in analyst-facing schemas; deletion requests must propagate within 72 hours.