Context
You’re interviewing for a Senior Data Engineer role on the Monetization Analytics team at StreamForge, a B2B SaaS video platform (think “Vimeo for enterprises”) with 12M monthly active end-users across customers. StreamForge sells tiered subscriptions (Starter/Pro/Enterprise) and usage-based add-ons (transcoding minutes, storage overages). Billing is managed in Stripe.
Today, the company has two disconnected data worlds:
- Stripe billing data (customers, subscriptions, invoices, charges, refunds, disputes) is synced nightly via a vendor connector into Snowflake. Finance uses it for revenue reporting, but it’s T+1 day and often wrong for the current day due to late-arriving webhooks and backdated invoice adjustments.
- Product event logs (page views, feature usage, “export video”, “invite teammate”, etc.) flow through Kafka and land in S3, then are batch-loaded into Snowflake every 6 hours. Product analytics can’t reliably attribute usage to the correct paid plan at the time of the event.
Leadership wants a single, trustworthy “billing + product usage” dataset to answer questions like: “Which features drive upgrades within 7 days?”, “What is ARPA by cohort and usage intensity?”, and “Are we providing paid features to delinquent accounts?” The stakes are high: incorrect plan attribution is causing mispriced enterprise renewals and revenue leakage. Additionally, StreamForge is preparing for SOC 2 Type II and needs auditable lineage and access controls for billing data.
Scale Requirements
- Product events: avg 120K events/sec, peak 350K events/sec during webinars; ~8 TB/day raw JSON.
- Stripe events: ~3M webhook events/day (invoices, charges, payment_intent, customer updates), bursty (retries can spike 10x).
- Freshness SLA:
- Product events queryable in Snowflake: < 10 minutes P95
- Stripe billing state queryable: < 5 minutes P95 from webhook receipt
- Correctness SLA:
- End-to-end dedupe: exactly-once logical processing for Stripe webhooks
- Plan attribution correctness: >= 99.9% for events within the last 30 days
- Retention:
- Raw immutable logs: 13 months
- Curated analytics tables: indefinite
Data Characteristics
Stripe (webhooks + periodic reconciliation)
- Out-of-order delivery and retries; same event may be delivered many times.
- Late-arriving adjustments: invoice finalized hours after usage; refunds/disputes can occur weeks later.
- Key entities:
customer.id, subscription.id, invoice.id, charge.id, payment_intent.id
- Event envelope:
event.id, event.type, created (Stripe timestamp)
Product events (client + server)
- Semi-structured JSON with evolving schemas.
- Common fields:
event_id, event_name, occurred_at, received_at, user_id, workspace_id, anonymous_id, properties.
- Identity is messy: users can belong to multiple workspaces; workspace maps to Stripe customer via internal CRM table with occasional delays.
Requirements
Functional
- Ingest Stripe webhooks in near real time, deduplicate, and materialize canonical billing tables (customers, subscriptions, invoices, charges, refunds).
- Ingest product events in near real time and build a clean, query-friendly event table with enforced schema and dedupe.
- Build a time-valid plan attribution model: for each product event, attach the effective plan and billing status (trialing/active/past_due/canceled) as-of the event time.
- Support late-arriving data and backfills:
- Stripe events arriving late/out-of-order
- Product events arriving up to 72 hours late (mobile offline)
- Reprocessing for the last 30 days without double counting
- Provide analytics-ready marts for:
- Upgrade funnel (feature usage → upgrade)
- Revenue + usage (MRR, refunds, disputes) joined to engagement
Non-functional
- Auditability & lineage: raw immutable storage, reproducible transformations, and change history.
- Security: restrict access to PII (email, card metadata) and comply with SOC 2 logging.
- Operational excellence: clear SLIs/SLOs, alerting, and runbooks.
Constraints
- Cloud: AWS. Warehouse: Snowflake already adopted.
- Streaming: Kafka already used for product events; team has moderate Spark experience.
- Budget: incremental infra spend capped at $35K/month.
- You must assume Stripe API rate limits and occasional connector drift; you need a reconciliation strategy.
What You Need To Produce (in the interview)
- A complete end-to-end architecture (stream + batch where appropriate).
- Data model for canonical billing + event tables and the plan attribution approach (SCD2 / temporal joins).
- Orchestration plan (Airflow) and transformation strategy (dbt vs Spark).
- Data quality checks and how you handle schema evolution.
- Monitoring, alerting, and failure recovery (including replay/backfill).
Be explicit about idempotency keys, watermarking strategy, and how you guarantee that “refunds and disputes update historical revenue” without corrupting downstream aggregates.