Context
You’re interviewing with the Growth Data Engineering team at StreamRide, a ride-sharing + food delivery marketplace operating in the US and Western Europe. StreamRide spends $8–12M/week across Google Ads, Meta Ads, TikTok Ads, Apple Search Ads, and affiliate networks. The executive team reviews daily Marketing ROI (spend → sign-ups → first ride/purchase revenue) every morning at 8am local time.
Today, the company’s ROI reporting is unreliable. Each ad platform is pulled by a different script, attribution logic is duplicated across dashboards, and sign-up events from web and mobile are inconsistently tagged. The result: ROI can swing ±20% day-over-day due to late-arriving conversions, API backfills, and double-counted clicks. Finance has flagged this as a material risk because it drives weekly budget reallocation decisions.
StreamRide’s data stack is AWS + Snowflake. Product events are emitted to Kafka and landed in S3. The Growth team wants a single source of truth that supports (a) near-real-time monitoring for campaign anomalies and (b) finance-grade daily close with auditability.
Current Architecture (and why it’s failing)
| Area | Current State | Pain Point |
|---|
| Ad platform ingestion | Cron scripts calling APIs → CSV in S3 | No retries/backoff, inconsistent schemas, missing historical corrections |
| Sign-up events | Kafka → S3 JSON → Snowflake | UTM parameters missing on iOS SKAdNetwork, duplicate events from retries |
| Attribution | Tableau calculations + ad-hoc SQL | Not versioned, not reproducible, hard to backfill |
| Reporting | One dashboard per channel | No unified ROI, inconsistent definitions |
Scale Requirements
- Ad data volume: ~50–150M impressions/day, 2–6M clicks/day, 50–200K conversions/day across all platforms.
- Sign-ups: 300–800K/day (web + iOS + Android). Peak: 20K/min during promotions.
- Freshness:
- Operational monitoring: < 15 minutes from click/conversion to queryable aggregates.
- Finance daily close: T+1 by 6am UTC with full reconciliation.
- Late data:
- Ad platforms can restate cost/conversions up to 7 days later.
- Mobile attribution (SKAdNetwork) can arrive 24–72 hours late.
- Retention:
- Raw event-level data: 13 months (YoY comparisons).
- Aggregates: indefinite.
Data Characteristics
Key source entities
-
Ad platform performance (API extracts):
- Grain varies by platform: campaign/adset/ad/keyword + date/hour.
- Fields (typical):
platform, account_id, campaign_id, ad_group_id, ad_id, keyword, impressions, clicks, spend_micros, currency, reporting_time, ingested_at.
- Issues: currency conversion, API pagination, rate limits, restatements, partial-day data.
-
Attribution signals
- Web: UTM parameters +
gclid/fbclid.
- Mobile: MMP events (e.g., AppsFlyer/Adjust) + SKAdNetwork postbacks.
- Issues: missing identifiers, multiple touchpoints, click spam, dedupe.
-
Product events (Kafka):
user_signup, app_install, first_ride, first_order, purchase.
- Issues: duplicates from client retries, out-of-order events, user merges.
Your Task
Design a complete pipeline to compute Marketing ROI by platform/campaign and attribute it to user sign-ups (and optionally downstream revenue). Your design must include ingestion, modeling, orchestration, data quality, and backfill strategy.
Functional requirements
- Ingest performance data from at least 4 ad platforms with consistent schemas and a clear strategy for API limits and restatements.
- Ingest sign-up and conversion events from Kafka/S3 into Snowflake with deduplication and schema evolution handling.
- Build an attribution model that supports:
- Last-touch by default (configurable to first-touch / linear).
- A lookback window (e.g., 7 days click-through, 1 day view-through).
- Mobile SKAdNetwork handling where user-level IDs are unavailable.
- Produce analytics-ready tables:
fact_ad_spend (cost + clicks + impressions)
fact_signups (deduped, enriched with channel/campaign)
fact_attribution (mapping signups → marketing touchpoints)
mart_marketing_roi_daily (ROI metrics)
- Support incremental loads (hourly) and backfills (7–30 days) without double counting.
Non-functional requirements
- Idempotency: rerunning any job for a given time range must not change results except for legitimate restatements.
- Auditability: ability to explain a daily ROI number down to raw inputs and attribution version.
- Data quality: automated checks for completeness, duplicates, and spend anomalies.
- Security/compliance: PII handling (email/phone) and GDPR deletion requests within 72 hours.
- Cost control: keep incremental infra spend under $40K/month.
Constraints
- Existing tools: Kafka, S3, Snowflake, Airflow 2.x, dbt are already approved.
- Team: 5 data engineers; strong SQL/dbt skills, moderate Spark experience.
- You cannot rely on ad platforms providing stable unique row IDs; you must define your own keys.
- Finance requires a stable definition of “spend” and “sign-up” with versioned logic.
What we will evaluate
- Correctness under late-arriving and restated data
- Robustness (retries, DLQs, backpressure)
- Data modeling choices (grains, keys, slowly changing dimensions)
- Observability and operational playbooks
- Warehouse performance and cost efficiency