Unify Stripe Billing With Product Events

Context

You’re interviewing for a Senior Data Engineer role on the Monetization Analytics team at StreamForge, a B2B SaaS video platform (think “Vimeo for enterprises”) with 12M monthly active end-users across customers. StreamForge sells tiered subscriptions (Starter/Pro/Enterprise) and usage-based add-ons (transcoding minutes, storage overages). Billing is managed in Stripe.

Today, the company has two disconnected data worlds:

Stripe billing data (customers, subscriptions, invoices, charges, refunds, disputes) is synced nightly via a vendor connector into Snowflake. Finance uses it for revenue reporting, but it’s T+1 day and often wrong for the current day due to late-arriving webhooks and backdated invoice adjustments.
Product event logs (page views, feature usage, “export video”, “invite teammate”, etc.) flow through Kafka and land in S3, then are batch-loaded into Snowflake every 6 hours. Product analytics can’t reliably attribute usage to the correct paid plan at the time of the event.

Leadership wants a single, trustworthy “billing + product usage” dataset to answer questions like: “Which features drive upgrades within 7 days?”, “What is ARPA by cohort and usage intensity?”, and “Are we providing paid features to delinquent accounts?” The stakes are high: incorrect plan attribution is causing mispriced enterprise renewals and revenue leakage. Additionally, StreamForge is preparing for SOC 2 Type II and needs auditable lineage and access controls for billing data.

Scale Requirements

Product events: avg 120K events/sec, peak 350K events/sec during webinars; ~8 TB/day raw JSON.
Stripe events: ~3M webhook events/day (invoices, charges, payment_intent, customer updates), bursty (retries can spike 10x).
Freshness SLA:
- Product events queryable in Snowflake: < 10 minutes P95
- Stripe billing state queryable: < 5 minutes P95 from webhook receipt
Correctness SLA:
- End-to-end dedupe: exactly-once logical processing for Stripe webhooks
- Plan attribution correctness: >= 99.9% for events within the last 30 days
Retention:
- Raw immutable logs: 13 months
- Curated analytics tables: indefinite

Data Characteristics

Stripe (webhooks + periodic reconciliation)

Out-of-order delivery and retries; same event may be delivered many times.
Late-arriving adjustments: invoice finalized hours after usage; refunds/disputes can occur weeks later.
Key entities:
- customer.id, subscription.id, invoice.id, charge.id, payment_intent.id
- Event envelope: event.id, event.type, created (Stripe timestamp)

Product events (client + server)

Semi-structured JSON with evolving schemas.
Common fields: event_id, event_name, occurred_at, received_at, user_id, workspace_id, anonymous_id, properties.
Identity is messy: users can belong to multiple workspaces; workspace maps to Stripe customer via internal CRM table with occasional delays.

Requirements

Functional

Ingest Stripe webhooks in near real time, deduplicate, and materialize canonical billing tables (customers, subscriptions, invoices, charges, refunds).
Ingest product events in near real time and build a clean, query-friendly event table with enforced schema and dedupe.
Build a time-valid plan attribution model: for each product event, attach the effective plan and billing status (trialing/active/past_due/canceled) as-of the event time.
Support late-arriving data and backfills:
- Stripe events arriving late/out-of-order
- Product events arriving up to 72 hours late (mobile offline)
- Reprocessing for the last 30 days without double counting
Provide analytics-ready marts for:
- Upgrade funnel (feature usage → upgrade)
- Revenue + usage (MRR, refunds, disputes) joined to engagement

Non-functional

Auditability & lineage: raw immutable storage, reproducible transformations, and change history.
Security: restrict access to PII (email, card metadata) and comply with SOC 2 logging.
Operational excellence: clear SLIs/SLOs, alerting, and runbooks.

Constraints

Cloud: AWS. Warehouse: Snowflake already adopted.
Streaming: Kafka already used for product events; team has moderate Spark experience.
Budget: incremental infra spend capped at $35K/month.
You must assume Stripe API rate limits and occasional connector drift; you need a reconciliation strategy.

What You Need To Produce (in the interview)

A complete end-to-end architecture (stream + batch where appropriate).
Data model for canonical billing + event tables and the plan attribution approach (SCD2 / temporal joins).
Orchestration plan (Airflow) and transformation strategy (dbt vs Spark).
Data quality checks and how you handle schema evolution.
Monitoring, alerting, and failure recovery (including replay/backfill).

Be explicit about idempotency keys, watermarking strategy, and how you guarantee that “refunds and disputes update historical revenue” without corrupting downstream aggregates.

Context

Today, the company has two disconnected data worlds:

Stripe billing data (customers, subscriptions, invoices, charges, refunds, disputes) is synced nightly via a vendor connector into Snowflake. Finance uses it for revenue reporting, but it’s T+1 day and often wrong for the current day due to late-arriving webhooks and backdated invoice adjustments.
Product event logs (page views, feature usage, “export video”, “invite teammate”, etc.) flow through Kafka and land in S3, then are batch-loaded into Snowflake every 6 hours. Product analytics can’t reliably attribute usage to the correct paid plan at the time of the event.

Scale Requirements

Product events: avg 120K events/sec, peak 350K events/sec during webinars; ~8 TB/day raw JSON.
Stripe events: ~3M webhook events/day (invoices, charges, payment_intent, customer updates), bursty (retries can spike 10x).
Freshness SLA:
- Product events queryable in Snowflake: < 10 minutes P95
- Stripe billing state queryable: < 5 minutes P95 from webhook receipt
Correctness SLA:
- End-to-end dedupe: exactly-once logical processing for Stripe webhooks
- Plan attribution correctness: >= 99.9% for events within the last 30 days
Retention:
- Raw immutable logs: 13 months
- Curated analytics tables: indefinite

Data Characteristics

Stripe (webhooks + periodic reconciliation)

Out-of-order delivery and retries; same event may be delivered many times.
Late-arriving adjustments: invoice finalized hours after usage; refunds/disputes can occur weeks later.
Key entities:
- customer.id, subscription.id, invoice.id, charge.id, payment_intent.id
- Event envelope: event.id, event.type, created (Stripe timestamp)

Product events (client + server)

Semi-structured JSON with evolving schemas.
Common fields: event_id, event_name, occurred_at, received_at, user_id, workspace_id, anonymous_id, properties.
Identity is messy: users can belong to multiple workspaces; workspace maps to Stripe customer via internal CRM table with occasional delays.

Requirements

Functional

Ingest Stripe webhooks in near real time, deduplicate, and materialize canonical billing tables (customers, subscriptions, invoices, charges, refunds).
Ingest product events in near real time and build a clean, query-friendly event table with enforced schema and dedupe.
Build a time-valid plan attribution model: for each product event, attach the effective plan and billing status (trialing/active/past_due/canceled) as-of the event time.
Support late-arriving data and backfills:
- Stripe events arriving late/out-of-order
- Product events arriving up to 72 hours late (mobile offline)
- Reprocessing for the last 30 days without double counting
Provide analytics-ready marts for:
- Upgrade funnel (feature usage → upgrade)
- Revenue + usage (MRR, refunds, disputes) joined to engagement

Non-functional

Auditability & lineage: raw immutable storage, reproducible transformations, and change history.
Security: restrict access to PII (email, card metadata) and comply with SOC 2 logging.
Operational excellence: clear SLIs/SLOs, alerting, and runbooks.

Constraints

Cloud: AWS. Warehouse: Snowflake already adopted.
Streaming: Kafka already used for product events; team has moderate Spark experience.
Budget: incremental infra spend capped at $35K/month.
You must assume Stripe API rate limits and occasional connector drift; you need a reconciliation strategy.

What You Need To Produce (in the interview)

A complete end-to-end architecture (stream + batch where appropriate).
Data model for canonical billing + event tables and the plan attribution approach (SCD2 / temporal joins).
Orchestration plan (Airflow) and transformation strategy (dbt vs Spark).
Data quality checks and how you handle schema evolution.
Monitoring, alerting, and failure recovery (including replay/backfill).

Be explicit about idempotency keys, watermarking strategy, and how you guarantee that “refunds and disputes update historical revenue” without corrupting downstream aggregates.

Context

Today, the company has two disconnected data worlds:

Stripe billing data (customers, subscriptions, invoices, charges, refunds, disputes) is synced nightly via a vendor connector into Snowflake. Finance uses it for revenue reporting, but it’s T+1 day and often wrong for the current day due to late-arriving webhooks and backdated invoice adjustments.
Product event logs (page views, feature usage, “export video”, “invite teammate”, etc.) flow through Kafka and land in S3, then are batch-loaded into Snowflake every 6 hours. Product analytics can’t reliably attribute usage to the correct paid plan at the time of the event.

Scale Requirements

Product events: avg 120K events/sec, peak 350K events/sec during webinars; ~8 TB/day raw JSON.
Stripe events: ~3M webhook events/day (invoices, charges, payment_intent, customer updates), bursty (retries can spike 10x).
Freshness SLA:
- Product events queryable in Snowflake: < 10 minutes P95
- Stripe billing state queryable: < 5 minutes P95 from webhook receipt
Correctness SLA:
- End-to-end dedupe: exactly-once logical processing for Stripe webhooks
- Plan attribution correctness: >= 99.9% for events within the last 30 days
Retention:
- Raw immutable logs: 13 months
- Curated analytics tables: indefinite

Data Characteristics

Stripe (webhooks + periodic reconciliation)

Out-of-order delivery and retries; same event may be delivered many times.
Late-arriving adjustments: invoice finalized hours after usage; refunds/disputes can occur weeks later.
Key entities:
- customer.id, subscription.id, invoice.id, charge.id, payment_intent.id
- Event envelope: event.id, event.type, created (Stripe timestamp)

Product events (client + server)

Semi-structured JSON with evolving schemas.
Common fields: event_id, event_name, occurred_at, received_at, user_id, workspace_id, anonymous_id, properties.
Identity is messy: users can belong to multiple workspaces; workspace maps to Stripe customer via internal CRM table with occasional delays.

Requirements

Functional

Ingest Stripe webhooks in near real time, deduplicate, and materialize canonical billing tables (customers, subscriptions, invoices, charges, refunds).
Ingest product events in near real time and build a clean, query-friendly event table with enforced schema and dedupe.
Build a time-valid plan attribution model: for each product event, attach the effective plan and billing status (trialing/active/past_due/canceled) as-of the event time.
Support late-arriving data and backfills:
- Stripe events arriving late/out-of-order
- Product events arriving up to 72 hours late (mobile offline)
- Reprocessing for the last 30 days without double counting
Provide analytics-ready marts for:
- Upgrade funnel (feature usage → upgrade)
- Revenue + usage (MRR, refunds, disputes) joined to engagement

Non-functional

Auditability & lineage: raw immutable storage, reproducible transformations, and change history.
Security: restrict access to PII (email, card metadata) and comply with SOC 2 logging.
Operational excellence: clear SLIs/SLOs, alerting, and runbooks.

Constraints

Cloud: AWS. Warehouse: Snowflake already adopted.
Streaming: Kafka already used for product events; team has moderate Spark experience.
Budget: incremental infra spend capped at $35K/month.
You must assume Stripe API rate limits and occasional connector drift; you need a reconciliation strategy.

What You Need To Produce (in the interview)

A complete end-to-end architecture (stream + batch where appropriate).
Data model for canonical billing + event tables and the plan attribution approach (SCD2 / temporal joins).
Orchestration plan (Airflow) and transformation strategy (dbt vs Spark).
Data quality checks and how you handle schema evolution.
Monitoring, alerting, and failure recovery (including replay/backfill).

Be explicit about idempotency keys, watermarking strategy, and how you guarantee that “refunds and disputes update historical revenue” without corrupting downstream aggregates.

Context

Today, the company has two disconnected data worlds:

Stripe billing data (customers, subscriptions, invoices, charges, refunds, disputes) is synced nightly via a vendor connector into Snowflake. Finance uses it for revenue reporting, but it’s T+1 day and often wrong for the current day due to late-arriving webhooks and backdated invoice adjustments.
Product event logs (page views, feature usage, “export video”, “invite teammate”, etc.) flow through Kafka and land in S3, then are batch-loaded into Snowflake every 6 hours. Product analytics can’t reliably attribute usage to the correct paid plan at the time of the event.

Scale Requirements

Product events: avg 120K events/sec, peak 350K events/sec during webinars; ~8 TB/day raw JSON.
Stripe events: ~3M webhook events/day (invoices, charges, payment_intent, customer updates), bursty (retries can spike 10x).
Freshness SLA:
- Product events queryable in Snowflake: < 10 minutes P95
- Stripe billing state queryable: < 5 minutes P95 from webhook receipt
Correctness SLA:
- End-to-end dedupe: exactly-once logical processing for Stripe webhooks
- Plan attribution correctness: >= 99.9% for events within the last 30 days
Retention:
- Raw immutable logs: 13 months
- Curated analytics tables: indefinite

Data Characteristics

Stripe (webhooks + periodic reconciliation)

Out-of-order delivery and retries; same event may be delivered many times.
Late-arriving adjustments: invoice finalized hours after usage; refunds/disputes can occur weeks later.
Key entities:
- customer.id, subscription.id, invoice.id, charge.id, payment_intent.id
- Event envelope: event.id, event.type, created (Stripe timestamp)

Product events (client + server)

Semi-structured JSON with evolving schemas.
Common fields: event_id, event_name, occurred_at, received_at, user_id, workspace_id, anonymous_id, properties.
Identity is messy: users can belong to multiple workspaces; workspace maps to Stripe customer via internal CRM table with occasional delays.

Requirements

Functional

Ingest Stripe webhooks in near real time, deduplicate, and materialize canonical billing tables (customers, subscriptions, invoices, charges, refunds).
Ingest product events in near real time and build a clean, query-friendly event table with enforced schema and dedupe.
Build a time-valid plan attribution model: for each product event, attach the effective plan and billing status (trialing/active/past_due/canceled) as-of the event time.
Support late-arriving data and backfills:
- Stripe events arriving late/out-of-order
- Product events arriving up to 72 hours late (mobile offline)
- Reprocessing for the last 30 days without double counting
Provide analytics-ready marts for:
- Upgrade funnel (feature usage → upgrade)
- Revenue + usage (MRR, refunds, disputes) joined to engagement

Non-functional

Auditability & lineage: raw immutable storage, reproducible transformations, and change history.
Security: restrict access to PII (email, card metadata) and comply with SOC 2 logging.
Operational excellence: clear SLIs/SLOs, alerting, and runbooks.

Constraints

Cloud: AWS. Warehouse: Snowflake already adopted.
Streaming: Kafka already used for product events; team has moderate Spark experience.
Budget: incremental infra spend capped at $35K/month.
You must assume Stripe API rate limits and occasional connector drift; you need a reconciliation strategy.

What You Need To Produce (in the interview)

A complete end-to-end architecture (stream + batch where appropriate).
Data model for canonical billing + event tables and the plan attribution approach (SCD2 / temporal joins).
Orchestration plan (Airflow) and transformation strategy (dbt vs Spark).
Data quality checks and how you handle schema evolution.
Monitoring, alerting, and failure recovery (including replay/backfill).

Be explicit about idempotency keys, watermarking strategy, and how you guarantee that “refunds and disputes update historical revenue” without corrupting downstream aggregates.

Interview Guides

Context

Scale Requirements

Data Characteristics

Stripe (webhooks + periodic reconciliation)

Product events (client + server)

Requirements

Functional

Non-functional

Constraints

What You Need To Produce (in the interview)

Unify Stripe Billing With Product Events

Context

Scale Requirements

Data Characteristics

Stripe (webhooks + periodic reconciliation)

Product events (client + server)

Requirements

Functional

Non-functional

Constraints

What You Need To Produce (in the interview)

Your Answer

Unify Stripe Billing With Product Events

Context

Scale Requirements

Data Characteristics

Stripe (webhooks + periodic reconciliation)

Product events (client + server)

Requirements

Functional

Non-functional

Constraints

What You Need To Produce (in the interview)

Unify Stripe Billing With Product Events

Context

Scale Requirements

Data Characteristics

Stripe (webhooks + periodic reconciliation)

Product events (client + server)

Requirements

Functional

Non-functional

Constraints

What You Need To Produce (in the interview)

Your Answer