Normalize Multi-Provider Billing Data

Context

LedgerLoop, a SaaS finance platform, receives billing events from Stripe, Chargebee, and several regional payment providers. Today, each provider is integrated with separate batch scripts that produce inconsistent schemas, duplicate invoices, and delayed updates, causing incorrect downstream tax calculations.

You need to design a unified ingestion and normalization pipeline that converts heterogeneous provider payloads into a canonical billing model consumed by an internal tax engine.

Scale Requirements

Providers: 8 external billing providers initially, growing to 20 within 12 months
Throughput: 2,500 events/sec peak during monthly renewals, 300 events/sec average
Payload size: 3-15 KB JSON/webhook event
Daily volume: ~120M billing events/day, ~700 GB raw JSON/day
Latency target: provider event to tax-engine-ready record in < 2 minutes P95
Retention: raw immutable events for 7 years; normalized serving tables for 24 months hot storage

Requirements

Ingest both webhooks and periodic API backfills from providers such as Stripe and Chargebee.
Normalize invoices, subscriptions, refunds, credits, disputes, and customer tax attributes into a canonical schema.
Handle out-of-order delivery, retries, duplicate webhooks, and provider-specific schema/version changes.
Guarantee idempotent processing so the tax engine never receives duplicate taxable events.
Support replay/backfill for a single provider, merchant account, or date range without corrupting downstream state.
Expose curated tables or topics for the internal tax engine and finance analytics.
Implement data quality checks for missing currency, invalid country codes, negative invoice totals, and referential integrity between invoice and line items.
Provide observability for freshness, error rates, schema drift, and reconciliation against provider APIs.

Constraints

Primary cloud is AWS; existing platform already uses S3, Airflow, and Snowflake.
PCI scope must remain minimal; avoid storing raw card details.
Tax records are audit-sensitive, so raw events must be immutable and replayable.
Team size is 5 data engineers; operational complexity should be reasonable.
Budget target is <$35K/month incremental infrastructure cost.

Context

You need to design a unified ingestion and normalization pipeline that converts heterogeneous provider payloads into a canonical billing model consumed by an internal tax engine.

Scale Requirements

Providers: 8 external billing providers initially, growing to 20 within 12 months
Throughput: 2,500 events/sec peak during monthly renewals, 300 events/sec average
Payload size: 3-15 KB JSON/webhook event
Daily volume: ~120M billing events/day, ~700 GB raw JSON/day
Latency target: provider event to tax-engine-ready record in < 2 minutes P95
Retention: raw immutable events for 7 years; normalized serving tables for 24 months hot storage

Requirements

Ingest both webhooks and periodic API backfills from providers such as Stripe and Chargebee.
Normalize invoices, subscriptions, refunds, credits, disputes, and customer tax attributes into a canonical schema.
Handle out-of-order delivery, retries, duplicate webhooks, and provider-specific schema/version changes.
Guarantee idempotent processing so the tax engine never receives duplicate taxable events.
Support replay/backfill for a single provider, merchant account, or date range without corrupting downstream state.
Expose curated tables or topics for the internal tax engine and finance analytics.
Implement data quality checks for missing currency, invalid country codes, negative invoice totals, and referential integrity between invoice and line items.
Provide observability for freshness, error rates, schema drift, and reconciliation against provider APIs.

Constraints

Primary cloud is AWS; existing platform already uses S3, Airflow, and Snowflake.
PCI scope must remain minimal; avoid storing raw card details.
Tax records are audit-sensitive, so raw events must be immutable and replayable.
Team size is 5 data engineers; operational complexity should be reasonable.
Budget target is <$35K/month incremental infrastructure cost.

Context

You need to design a unified ingestion and normalization pipeline that converts heterogeneous provider payloads into a canonical billing model consumed by an internal tax engine.

Scale Requirements

Providers: 8 external billing providers initially, growing to 20 within 12 months
Throughput: 2,500 events/sec peak during monthly renewals, 300 events/sec average
Payload size: 3-15 KB JSON/webhook event
Daily volume: ~120M billing events/day, ~700 GB raw JSON/day
Latency target: provider event to tax-engine-ready record in < 2 minutes P95
Retention: raw immutable events for 7 years; normalized serving tables for 24 months hot storage

Requirements

Ingest both webhooks and periodic API backfills from providers such as Stripe and Chargebee.
Normalize invoices, subscriptions, refunds, credits, disputes, and customer tax attributes into a canonical schema.
Handle out-of-order delivery, retries, duplicate webhooks, and provider-specific schema/version changes.
Guarantee idempotent processing so the tax engine never receives duplicate taxable events.
Support replay/backfill for a single provider, merchant account, or date range without corrupting downstream state.
Expose curated tables or topics for the internal tax engine and finance analytics.
Implement data quality checks for missing currency, invalid country codes, negative invoice totals, and referential integrity between invoice and line items.
Provide observability for freshness, error rates, schema drift, and reconciliation against provider APIs.

Constraints

Primary cloud is AWS; existing platform already uses S3, Airflow, and Snowflake.
PCI scope must remain minimal; avoid storing raw card details.
Tax records are audit-sensitive, so raw events must be immutable and replayable.
Team size is 5 data engineers; operational complexity should be reasonable.
Budget target is <$35K/month incremental infrastructure cost.

Context

You need to design a unified ingestion and normalization pipeline that converts heterogeneous provider payloads into a canonical billing model consumed by an internal tax engine.

Scale Requirements

Providers: 8 external billing providers initially, growing to 20 within 12 months
Throughput: 2,500 events/sec peak during monthly renewals, 300 events/sec average
Payload size: 3-15 KB JSON/webhook event
Daily volume: ~120M billing events/day, ~700 GB raw JSON/day
Latency target: provider event to tax-engine-ready record in < 2 minutes P95
Retention: raw immutable events for 7 years; normalized serving tables for 24 months hot storage

Requirements

Ingest both webhooks and periodic API backfills from providers such as Stripe and Chargebee.
Normalize invoices, subscriptions, refunds, credits, disputes, and customer tax attributes into a canonical schema.
Handle out-of-order delivery, retries, duplicate webhooks, and provider-specific schema/version changes.
Guarantee idempotent processing so the tax engine never receives duplicate taxable events.
Support replay/backfill for a single provider, merchant account, or date range without corrupting downstream state.
Expose curated tables or topics for the internal tax engine and finance analytics.
Implement data quality checks for missing currency, invalid country codes, negative invoice totals, and referential integrity between invoice and line items.
Provide observability for freshness, error rates, schema drift, and reconciliation against provider APIs.

Constraints

Primary cloud is AWS; existing platform already uses S3, Airflow, and Snowflake.
PCI scope must remain minimal; avoid storing raw card details.
Tax records are audit-sensitive, so raw events must be immutable and replayable.
Team size is 5 data engineers; operational complexity should be reasonable.
Budget target is <$35K/month incremental infrastructure cost.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Normalize Multi-Provider Billing Data

Context

Scale Requirements

Requirements

Constraints

Your Answer

Normalize Multi-Provider Billing Data

Context

Scale Requirements

Requirements

Constraints

Normalize Multi-Provider Billing Data

Context

Scale Requirements

Requirements

Constraints

Your Answer