Context
LedgerLoop, a SaaS finance platform, receives billing events from Stripe, Chargebee, and several regional payment providers. Today, each provider is integrated with separate batch scripts that produce inconsistent schemas, duplicate invoices, and delayed updates, causing incorrect downstream tax calculations.
You need to design a unified ingestion and normalization pipeline that converts heterogeneous provider payloads into a canonical billing model consumed by an internal tax engine.
Scale Requirements
- Providers: 8 external billing providers initially, growing to 20 within 12 months
- Throughput: 2,500 events/sec peak during monthly renewals, 300 events/sec average
- Payload size: 3-15 KB JSON/webhook event
- Daily volume: ~120M billing events/day, ~700 GB raw JSON/day
- Latency target: provider event to tax-engine-ready record in < 2 minutes P95
- Retention: raw immutable events for 7 years; normalized serving tables for 24 months hot storage
Requirements
- Ingest both webhooks and periodic API backfills from providers such as Stripe and Chargebee.
- Normalize invoices, subscriptions, refunds, credits, disputes, and customer tax attributes into a canonical schema.
- Handle out-of-order delivery, retries, duplicate webhooks, and provider-specific schema/version changes.
- Guarantee idempotent processing so the tax engine never receives duplicate taxable events.
- Support replay/backfill for a single provider, merchant account, or date range without corrupting downstream state.
- Expose curated tables or topics for the internal tax engine and finance analytics.
- Implement data quality checks for missing currency, invalid country codes, negative invoice totals, and referential integrity between invoice and line items.
- Provide observability for freshness, error rates, schema drift, and reconciliation against provider APIs.
Constraints
- Primary cloud is AWS; existing platform already uses S3, Airflow, and Snowflake.
- PCI scope must remain minimal; avoid storing raw card details.
- Tax records are audit-sensitive, so raw events must be immutable and replayable.
- Team size is 5 data engineers; operational complexity should be reasonable.
- Budget target is <$35K/month incremental infrastructure cost.