Context
You’re interviewing with the Risk & Identity org at Stripe, supporting a new feature called Merchant Risk Controls. Merchants can configure rules (e.g., “block payments from high-risk countries”, “require 3DS for cards above $X”, “allowlist customer IDs”) and Stripe applies these rules in real time during authorization.
The product team now needs historical reporting for merchants and internal auditors: “What rules were active on March 3rd at 10:15 UTC?”, “How many payments were declined due to rule version V?”, and “Show a month-over-month trend of declines by rule category.” This is not just analytics—Stripe must support auditability for disputes and regulatory inquiries. The current system stores only the latest rule configuration per merchant in a transactional DB and emits payment decision events to Kafka.
Today’s data platform is AWS-based: Kafka → S3 data lake (raw) → Snowflake for analytics. Transformations are primarily dbt orchestrated by Airflow. The team has had incidents where late-arriving events and replays caused double-counting, so the new design must be idempotent and support backfills.
Scale Requirements
- Merchants: ~3M active merchants
- Payments: average 25K events/sec, peak 120K events/sec (global traffic spikes)
- Rule changes: average 2 changes/day/merchant for the top 50K merchants; long tail is sparse
- Latency:
- Payment decision reporting freshness: < 10 minutes end-to-end
- Rule configuration history availability: < 30 minutes
- Retention:
- Raw immutable events: 7 years (audit requirement)
- Curated analytics tables: 2+ years online, older data can be archived but must be recoverable
Data Characteristics
Source systems
- Rules Service (Postgres)
- Stores current rule configuration per merchant.
- Emits change events (CDC) when rules are created/updated/deleted.
- Payment Decision Stream (Kafka)
- One event per authorization attempt with decision outcome and reason codes.
Common issues you must handle
- Out-of-order events: CDC events can arrive late relative to payment decisions.
- Replays: Kafka topics may be replayed for incident recovery.
- Schema evolution: rule definitions change (new fields, renamed fields, new rule types).
- PII: some rule conditions may include customer identifiers; strict access controls apply.
Example payloads
Rule change event (CDC-like):
| field | example |
|---|
| merchant_id | "m_123" |
| rule_id | "r_abc" |
| event_type | "UPSERT" |
| rule_payload | JSON blob (conditions, thresholds, metadata) |
| valid_from_ts | "2026-01-10T12:00:00Z" (business effective time) |
| emitted_ts | "2026-01-10T12:00:03Z" |
| version | 17 |
Payment decision event:
| field | example |
|---|
| payment_id | "pi_789" |
| merchant_id | "m_123" |
| auth_ts | "2026-01-10T12:00:02Z" |
| decision | "DECLINE" |
| decline_reason | "RISK_RULE" |
| rule_id_applied | "r_abc" (nullable) |
| rule_version_applied | 17 (nullable) |
| event_id | UUID |
Your Task
Design a schema + ELT pipeline that enables accurate point-in-time historical reporting of rule configurations and their impact on payment decisions.
Functional requirements
- Historical rule state: Support queries like “show the full rule configuration as-of timestamp T” for any merchant.
- Point-in-time attribution: For each payment decision, attribute the decision to the correct rule version that was effective at
auth_ts (not arrival time).
- Idempotent loads: Reprocessing the same Kafka ranges or re-running Airflow DAGs must not duplicate facts.
- Late-arriving data: Correctly handle late CDC events (up to 7 days late) and late payment events (up to 24 hours late).
- Backfills: Support backfilling a full month of history (e.g., after a bug fix) without taking the warehouse down.
- Schema evolution: Allow new rule types/fields without breaking downstream models; maintain audit trail.
Non-functional requirements
- Auditability: Changes must be traceable (who/what/when), and raw events must be immutable.
- Performance: Common merchant-level reports should run in < 5 seconds on Snowflake for a 30-day window.
- Security & compliance: PII fields must be protected; enforce least privilege and support data deletion where applicable.
- Operational excellence: Monitoring, alerting, and runbooks for data delays, quality regressions, and cost spikes.
Constraints
- Must use existing stack: Kafka, S3, Snowflake, dbt, Airflow.
- No new paid vendor tools this quarter (open-source ok).
- Team is 5 data engineers; on-call load is high, so the design should minimize operational complexity.
- Some merchants are on EU residency: raw and curated data must be stored/processed in-region (assume separate Snowflake accounts or databases).
Interview Prompts (what you should cover)
- What tables would you create (raw, staging, curated)? What are the keys?
- Would you model rule history as SCD Type 2, event-sourcing, or both? Why?
- How do you join payments to the correct rule version with late/out-of-order events?
- How do you implement incremental processing and backfills in dbt/Airflow?
- What data quality checks are mandatory (and where do they run)?
- How do you optimize for Snowflake query performance and cost?