Context
You’re interviewing for a Senior Data Engineer role on the Growth Analytics platform at ShopWave, a global e-commerce marketplace (fashion + electronics) with 35M MAUs, 3.5M DAUs, and ~$4B annual GMV. Marketing spend is ~$40M/month across Google Ads, Meta, TikTok, affiliate networks, and email, and executives use attribution to allocate budget weekly. A 1–2% attribution error can shift millions of dollars of spend.
Today, ShopWave runs a mostly batch pipeline: ad platform exports land in S3 daily, web/app events stream to Kafka, and purchases live in a Postgres OLTP + payments provider webhooks. Analysts stitch these together in ad-hoc SQL, producing inconsistent results (different “source of truth” tables, mismatched lookback windows, and double-counted conversions). The VP of Growth wants a standardized marketing attribution data model that supports first-touch, last-touch, linear, time-decay, and data-driven (weights provided by a model) attribution—while handling late-arriving events, identity resolution, and campaign metadata drift.
Current Architecture (and pain)
| Domain | Source | Delivery | Common Issues |
|---|
| Ad interactions | Google/Meta/TikTok APIs + affiliate S2S postbacks | Daily files to S3 + occasional webhooks | Late reports (24–72h), duplicates, changing campaign names |
| Web/app clickstream | Web SDK + mobile SDK | Kafka (peak 250K events/sec) | Missing/rotating cookies, bot traffic, out-of-order events |
| Purchases | Postgres + payment processor | CDC + webhook | Refunds/chargebacks days later, order updates |
| Identity | Login events + CRM | Batch | Users browse anonymously then log in; multiple devices |
The business needs a consistent schema and pipeline so that dashboards (Looker) and budget optimization jobs use the same attribution logic.
Scale Requirements
- Clickstream: avg 80K events/sec, peak 250K events/sec; ~1.5KB/event JSON
- Ad events: 1–3B impressions/day, 30–80M clicks/day (varies by channel)
- Orders: 2–5M orders/day; peak 20K orders/min during promos
- Freshness:
- Click/cookie-level touchpoints queryable in Snowflake within <10 minutes
- Channel-level attribution aggregates within <30 minutes
- Late data:
- Ad platforms restate metrics for up to 7 days
- Refunds/chargebacks up to 30 days
- Retention: raw events 90 days; modeled attribution tables 2+ years
Data Characteristics (what you must model)
You need to support both event-level and aggregated reporting.
-
Touchpoints
- impression, click, email_open, email_click, affiliate_click, push_open
- fields:
event_id, event_ts, channel, source, campaign_id, adset_id, creative_id, utm_*, gclid/fbclid, device_id, cookie_id, user_id (nullable), ip, user_agent
-
Conversions
- order_created, order_paid, refund, chargeback
- fields:
order_id, user_id, order_ts, revenue, currency, status, is_new_customer
-
Identity graph
- cookie ↔ device ↔ user mappings over time (many-to-one and one-to-many)
- must be time-bounded (a cookie can be re-assigned on shared devices)
-
Campaign metadata
- campaign names and budgets change; IDs may be stable but naming isn’t
- need slowly changing dimension behavior
Your Task
Design a Snowflake-centered ELT schema and pipeline that produces a canonical attribution dataset.
Functional Requirements
- Canonical schema that supports:
- multi-touch attribution per order (first/last/linear/time-decay)
- configurable lookback windows (e.g., 7/14/28 days)
- cross-device attribution via identity resolution
- Incremental processing with correct handling of:
- late-arriving touchpoints and restated ad reports
- order updates (refunds/chargebacks) and re-attribution
- Deduplication & idempotency across sources (API pulls, webhooks, SDK retries)
- Data quality framework: schema validation, referential integrity, anomaly detection
- Serving layer for:
- analyst queries (Looker)
- downstream budget optimizer that reads daily channel/campaign ROI
Non-Functional Requirements
- Cost control: keep incremental Snowflake compute under $20K/month for this domain
- Auditability: ability to explain “why was this order attributed to X?” (touchpoint trail)
- Privacy/compliance: GDPR/CCPA deletion within 72 hours across modeled tables
Constraints
- Cloud: AWS + Snowflake already standardized
- Orchestration: Airflow 2.x is required (managed MWAA)
- Transformations: team prefers dbt for SQL modeling
- Streaming: Kafka exists; Spark is available on EMR, but team size is small (5 DEs)
- Attribution logic must be reproducible and versioned (no silent logic changes)
What to Produce in Your Answer
- Propose the core tables (fact/dimension) and keys (including surrogate keys where needed)
- Explain how you will model:
- touchpoints vs conversions
- identity mapping over time
- campaign metadata changes (SCD)
- attribution outputs (event-level and aggregated)
- Describe incremental/backfill strategy for 7-day restatements and 30-day refunds
- Outline orchestration, data quality checks, and how you’d monitor correctness and freshness