Schema for Multi-Touch Attribution

Context

You’re interviewing for a Senior Data Engineer role on the Growth Analytics platform at ShopWave, a global e-commerce marketplace (fashion + electronics) with 35M MAUs, 3.5M DAUs, and ~$4B annual GMV. Marketing spend is ~$40M/month across Google Ads, Meta, TikTok, affiliate networks, and email, and executives use attribution to allocate budget weekly. A 1–2% attribution error can shift millions of dollars of spend.

Today, ShopWave runs a mostly batch pipeline: ad platform exports land in S3 daily, web/app events stream to Kafka, and purchases live in a Postgres OLTP + payments provider webhooks. Analysts stitch these together in ad-hoc SQL, producing inconsistent results (different “source of truth” tables, mismatched lookback windows, and double-counted conversions). The VP of Growth wants a standardized marketing attribution data model that supports first-touch, last-touch, linear, time-decay, and data-driven (weights provided by a model) attribution—while handling late-arriving events, identity resolution, and campaign metadata drift.

Current Architecture (and pain)

Domain	Source	Delivery	Common Issues
Ad interactions	Google/Meta/TikTok APIs + affiliate S2S postbacks	Daily files to S3 + occasional webhooks	Late reports (24–72h), duplicates, changing campaign names
Web/app clickstream	Web SDK + mobile SDK	Kafka (peak 250K events/sec)	Missing/rotating cookies, bot traffic, out-of-order events
Purchases	Postgres + payment processor	CDC + webhook	Refunds/chargebacks days later, order updates
Identity	Login events + CRM	Batch	Users browse anonymously then log in; multiple devices

The business needs a consistent schema and pipeline so that dashboards (Looker) and budget optimization jobs use the same attribution logic.

Scale Requirements

Clickstream: avg 80K events/sec, peak 250K events/sec; ~1.5KB/event JSON
Ad events: 1–3B impressions/day, 30–80M clicks/day (varies by channel)
Orders: 2–5M orders/day; peak 20K orders/min during promos
Freshness:
- Click/cookie-level touchpoints queryable in Snowflake within <10 minutes
- Channel-level attribution aggregates within <30 minutes
Late data:
- Ad platforms restate metrics for up to 7 days
- Refunds/chargebacks up to 30 days
Retention: raw events 90 days; modeled attribution tables 2+ years

Data Characteristics (what you must model)

You need to support both event-level and aggregated reporting.

Touchpoints
- impression, click, email_open, email_click, affiliate_click, push_open
- fields: event_id, event_ts, channel, source, campaign_id, adset_id, creative_id, utm_*, gclid/fbclid, device_id, cookie_id, user_id (nullable), ip, user_agent
Conversions
- order_created, order_paid, refund, chargeback
- fields: order_id, user_id, order_ts, revenue, currency, status, is_new_customer
Identity graph
- cookie ↔ device ↔ user mappings over time (many-to-one and one-to-many)
- must be time-bounded (a cookie can be re-assigned on shared devices)
Campaign metadata
- campaign names and budgets change; IDs may be stable but naming isn’t
- need slowly changing dimension behavior

Your Task

Design a Snowflake-centered ELT schema and pipeline that produces a canonical attribution dataset.

Functional Requirements

Canonical schema that supports:
- multi-touch attribution per order (first/last/linear/time-decay)
- configurable lookback windows (e.g., 7/14/28 days)
- cross-device attribution via identity resolution
Incremental processing with correct handling of:
- late-arriving touchpoints and restated ad reports
- order updates (refunds/chargebacks) and re-attribution
Deduplication & idempotency across sources (API pulls, webhooks, SDK retries)
Data quality framework: schema validation, referential integrity, anomaly detection
Serving layer for:
- analyst queries (Looker)
- downstream budget optimizer that reads daily channel/campaign ROI

Non-Functional Requirements

Cost control: keep incremental Snowflake compute under $20K/month for this domain
Auditability: ability to explain “why was this order attributed to X?” (touchpoint trail)
Privacy/compliance: GDPR/CCPA deletion within 72 hours across modeled tables

Constraints

Cloud: AWS + Snowflake already standardized
Orchestration: Airflow 2.x is required (managed MWAA)
Transformations: team prefers dbt for SQL modeling
Streaming: Kafka exists; Spark is available on EMR, but team size is small (5 DEs)
Attribution logic must be reproducible and versioned (no silent logic changes)

What to Produce in Your Answer

Propose the core tables (fact/dimension) and keys (including surrogate keys where needed)
Explain how you will model:
- touchpoints vs conversions
- identity mapping over time
- campaign metadata changes (SCD)
- attribution outputs (event-level and aggregated)
Describe incremental/backfill strategy for 7-day restatements and 30-day refunds
Outline orchestration, data quality checks, and how you’d monitor correctness and freshness

Context

Current Architecture (and pain)

Domain	Source	Delivery	Common Issues
Ad interactions	Google/Meta/TikTok APIs + affiliate S2S postbacks	Daily files to S3 + occasional webhooks	Late reports (24–72h), duplicates, changing campaign names
Web/app clickstream	Web SDK + mobile SDK	Kafka (peak 250K events/sec)	Missing/rotating cookies, bot traffic, out-of-order events
Purchases	Postgres + payment processor	CDC + webhook	Refunds/chargebacks days later, order updates
Identity	Login events + CRM	Batch	Users browse anonymously then log in; multiple devices

The business needs a consistent schema and pipeline so that dashboards (Looker) and budget optimization jobs use the same attribution logic.

Scale Requirements

Clickstream: avg 80K events/sec, peak 250K events/sec; ~1.5KB/event JSON
Ad events: 1–3B impressions/day, 30–80M clicks/day (varies by channel)
Orders: 2–5M orders/day; peak 20K orders/min during promos
Freshness:
- Click/cookie-level touchpoints queryable in Snowflake within <10 minutes
- Channel-level attribution aggregates within <30 minutes
Late data:
- Ad platforms restate metrics for up to 7 days
- Refunds/chargebacks up to 30 days
Retention: raw events 90 days; modeled attribution tables 2+ years

Data Characteristics (what you must model)

You need to support both event-level and aggregated reporting.

Touchpoints
- impression, click, email_open, email_click, affiliate_click, push_open
- fields: event_id, event_ts, channel, source, campaign_id, adset_id, creative_id, utm_*, gclid/fbclid, device_id, cookie_id, user_id (nullable), ip, user_agent
Conversions
- order_created, order_paid, refund, chargeback
- fields: order_id, user_id, order_ts, revenue, currency, status, is_new_customer
Identity graph
- cookie ↔ device ↔ user mappings over time (many-to-one and one-to-many)
- must be time-bounded (a cookie can be re-assigned on shared devices)
Campaign metadata
- campaign names and budgets change; IDs may be stable but naming isn’t
- need slowly changing dimension behavior

Your Task

Design a Snowflake-centered ELT schema and pipeline that produces a canonical attribution dataset.

Functional Requirements

Canonical schema that supports:
- multi-touch attribution per order (first/last/linear/time-decay)
- configurable lookback windows (e.g., 7/14/28 days)
- cross-device attribution via identity resolution
Incremental processing with correct handling of:
- late-arriving touchpoints and restated ad reports
- order updates (refunds/chargebacks) and re-attribution
Deduplication & idempotency across sources (API pulls, webhooks, SDK retries)
Data quality framework: schema validation, referential integrity, anomaly detection
Serving layer for:
- analyst queries (Looker)
- downstream budget optimizer that reads daily channel/campaign ROI

Non-Functional Requirements

Cost control: keep incremental Snowflake compute under $20K/month for this domain
Auditability: ability to explain “why was this order attributed to X?” (touchpoint trail)
Privacy/compliance: GDPR/CCPA deletion within 72 hours across modeled tables

Constraints

Cloud: AWS + Snowflake already standardized
Orchestration: Airflow 2.x is required (managed MWAA)
Transformations: team prefers dbt for SQL modeling
Streaming: Kafka exists; Spark is available on EMR, but team size is small (5 DEs)
Attribution logic must be reproducible and versioned (no silent logic changes)

What to Produce in Your Answer

Propose the core tables (fact/dimension) and keys (including surrogate keys where needed)
Explain how you will model:
- touchpoints vs conversions
- identity mapping over time
- campaign metadata changes (SCD)
- attribution outputs (event-level and aggregated)
Describe incremental/backfill strategy for 7-day restatements and 30-day refunds
Outline orchestration, data quality checks, and how you’d monitor correctness and freshness

Context

Current Architecture (and pain)

Domain	Source	Delivery	Common Issues
Ad interactions	Google/Meta/TikTok APIs + affiliate S2S postbacks	Daily files to S3 + occasional webhooks	Late reports (24–72h), duplicates, changing campaign names
Web/app clickstream	Web SDK + mobile SDK	Kafka (peak 250K events/sec)	Missing/rotating cookies, bot traffic, out-of-order events
Purchases	Postgres + payment processor	CDC + webhook	Refunds/chargebacks days later, order updates
Identity	Login events + CRM	Batch	Users browse anonymously then log in; multiple devices

The business needs a consistent schema and pipeline so that dashboards (Looker) and budget optimization jobs use the same attribution logic.

Scale Requirements

Clickstream: avg 80K events/sec, peak 250K events/sec; ~1.5KB/event JSON
Ad events: 1–3B impressions/day, 30–80M clicks/day (varies by channel)
Orders: 2–5M orders/day; peak 20K orders/min during promos
Freshness:
- Click/cookie-level touchpoints queryable in Snowflake within <10 minutes
- Channel-level attribution aggregates within <30 minutes
Late data:
- Ad platforms restate metrics for up to 7 days
- Refunds/chargebacks up to 30 days
Retention: raw events 90 days; modeled attribution tables 2+ years

Data Characteristics (what you must model)

You need to support both event-level and aggregated reporting.

Touchpoints
- impression, click, email_open, email_click, affiliate_click, push_open
- fields: event_id, event_ts, channel, source, campaign_id, adset_id, creative_id, utm_*, gclid/fbclid, device_id, cookie_id, user_id (nullable), ip, user_agent
Conversions
- order_created, order_paid, refund, chargeback
- fields: order_id, user_id, order_ts, revenue, currency, status, is_new_customer
Identity graph
- cookie ↔ device ↔ user mappings over time (many-to-one and one-to-many)
- must be time-bounded (a cookie can be re-assigned on shared devices)
Campaign metadata
- campaign names and budgets change; IDs may be stable but naming isn’t
- need slowly changing dimension behavior

Your Task

Design a Snowflake-centered ELT schema and pipeline that produces a canonical attribution dataset.

Functional Requirements

Canonical schema that supports:
- multi-touch attribution per order (first/last/linear/time-decay)
- configurable lookback windows (e.g., 7/14/28 days)
- cross-device attribution via identity resolution
Incremental processing with correct handling of:
- late-arriving touchpoints and restated ad reports
- order updates (refunds/chargebacks) and re-attribution
Deduplication & idempotency across sources (API pulls, webhooks, SDK retries)
Data quality framework: schema validation, referential integrity, anomaly detection
Serving layer for:
- analyst queries (Looker)
- downstream budget optimizer that reads daily channel/campaign ROI

Non-Functional Requirements

Cost control: keep incremental Snowflake compute under $20K/month for this domain
Auditability: ability to explain “why was this order attributed to X?” (touchpoint trail)
Privacy/compliance: GDPR/CCPA deletion within 72 hours across modeled tables

Constraints

Cloud: AWS + Snowflake already standardized
Orchestration: Airflow 2.x is required (managed MWAA)
Transformations: team prefers dbt for SQL modeling
Streaming: Kafka exists; Spark is available on EMR, but team size is small (5 DEs)
Attribution logic must be reproducible and versioned (no silent logic changes)

What to Produce in Your Answer

Propose the core tables (fact/dimension) and keys (including surrogate keys where needed)
Explain how you will model:
- touchpoints vs conversions
- identity mapping over time
- campaign metadata changes (SCD)
- attribution outputs (event-level and aggregated)
Describe incremental/backfill strategy for 7-day restatements and 30-day refunds
Outline orchestration, data quality checks, and how you’d monitor correctness and freshness

Context

Current Architecture (and pain)

Domain	Source	Delivery	Common Issues
Ad interactions	Google/Meta/TikTok APIs + affiliate S2S postbacks	Daily files to S3 + occasional webhooks	Late reports (24–72h), duplicates, changing campaign names
Web/app clickstream	Web SDK + mobile SDK	Kafka (peak 250K events/sec)	Missing/rotating cookies, bot traffic, out-of-order events
Purchases	Postgres + payment processor	CDC + webhook	Refunds/chargebacks days later, order updates
Identity	Login events + CRM	Batch	Users browse anonymously then log in; multiple devices

The business needs a consistent schema and pipeline so that dashboards (Looker) and budget optimization jobs use the same attribution logic.

Scale Requirements

Clickstream: avg 80K events/sec, peak 250K events/sec; ~1.5KB/event JSON
Ad events: 1–3B impressions/day, 30–80M clicks/day (varies by channel)
Orders: 2–5M orders/day; peak 20K orders/min during promos
Freshness:
- Click/cookie-level touchpoints queryable in Snowflake within <10 minutes
- Channel-level attribution aggregates within <30 minutes
Late data:
- Ad platforms restate metrics for up to 7 days
- Refunds/chargebacks up to 30 days
Retention: raw events 90 days; modeled attribution tables 2+ years

Data Characteristics (what you must model)

You need to support both event-level and aggregated reporting.

Touchpoints
- impression, click, email_open, email_click, affiliate_click, push_open
- fields: event_id, event_ts, channel, source, campaign_id, adset_id, creative_id, utm_*, gclid/fbclid, device_id, cookie_id, user_id (nullable), ip, user_agent
Conversions
- order_created, order_paid, refund, chargeback
- fields: order_id, user_id, order_ts, revenue, currency, status, is_new_customer
Identity graph
- cookie ↔ device ↔ user mappings over time (many-to-one and one-to-many)
- must be time-bounded (a cookie can be re-assigned on shared devices)
Campaign metadata
- campaign names and budgets change; IDs may be stable but naming isn’t
- need slowly changing dimension behavior

Your Task

Design a Snowflake-centered ELT schema and pipeline that produces a canonical attribution dataset.

Functional Requirements

Canonical schema that supports:
- multi-touch attribution per order (first/last/linear/time-decay)
- configurable lookback windows (e.g., 7/14/28 days)
- cross-device attribution via identity resolution
Incremental processing with correct handling of:
- late-arriving touchpoints and restated ad reports
- order updates (refunds/chargebacks) and re-attribution
Deduplication & idempotency across sources (API pulls, webhooks, SDK retries)
Data quality framework: schema validation, referential integrity, anomaly detection
Serving layer for:
- analyst queries (Looker)
- downstream budget optimizer that reads daily channel/campaign ROI

Non-Functional Requirements

Cost control: keep incremental Snowflake compute under $20K/month for this domain
Auditability: ability to explain “why was this order attributed to X?” (touchpoint trail)
Privacy/compliance: GDPR/CCPA deletion within 72 hours across modeled tables

Constraints

Cloud: AWS + Snowflake already standardized
Orchestration: Airflow 2.x is required (managed MWAA)
Transformations: team prefers dbt for SQL modeling
Streaming: Kafka exists; Spark is available on EMR, but team size is small (5 DEs)
Attribution logic must be reproducible and versioned (no silent logic changes)

What to Produce in Your Answer

Propose the core tables (fact/dimension) and keys (including surrogate keys where needed)
Explain how you will model:
- touchpoints vs conversions
- identity mapping over time
- campaign metadata changes (SCD)
- attribution outputs (event-level and aggregated)
Describe incremental/backfill strategy for 7-day restatements and 30-day refunds
Outline orchestration, data quality checks, and how you’d monitor correctness and freshness

Interview Guides

Context

Current Architecture (and pain)

Scale Requirements

Data Characteristics (what you must model)

Your Task

Functional Requirements

Non-Functional Requirements

Constraints

What to Produce in Your Answer

Schema for Multi-Touch Attribution

Context

Current Architecture (and pain)

Scale Requirements

Data Characteristics (what you must model)

Your Task

Functional Requirements

Non-Functional Requirements

Constraints

What to Produce in Your Answer

Your Answer

Schema for Multi-Touch Attribution

Context

Current Architecture (and pain)

Scale Requirements

Data Characteristics (what you must model)

Your Task

Functional Requirements

Non-Functional Requirements

Constraints

What to Produce in Your Answer

Schema for Multi-Touch Attribution

Context

Current Architecture (and pain)

Scale Requirements

Data Characteristics (what you must model)

Your Task

Functional Requirements

Non-Functional Requirements

Constraints

What to Produce in Your Answer

Your Answer