Context
You’re interviewing with the Revenue Data Platform team at PayPilot, a B2B fintech that provides subscription billing and embedded payments for ~12,000 merchants. PayPilot has $4B/year in payment volume and is preparing for a SOC 2 Type II renewal and a PCI scope review. Finance and Sales Ops rely on analytics in Snowflake to reconcile revenue, compute commissions, and detect fraud/refunds.
Today, PayPilot ingests data from Stripe (payments, disputes, refunds) and Salesforce (accounts, opportunities, contracts) into Snowflake. The current approach is a mix of ad-hoc scripts and a partially configured ELT tool. Data freshness is inconsistent (sometimes 2–12 hours behind), and the team has had multiple incidents where late-arriving refunds caused revenue dashboards to be wrong for days. The CFO has mandated a measurable SLA: 99.9% of days must have complete, reconciled revenue by 8:00 AM UTC.
The VP of Data is considering standardizing ingestion using Fivetran for managed connectors, but the staff engineers argue a custom Python connector (deployed on Kubernetes, orchestrated by Airflow) will be cheaper and more controllable—especially for edge cases like Stripe event backfills, custom Salesforce objects, and strict audit logging.
Current Architecture (simplified)
- Snowflake is the central warehouse (raw + curated schemas).
- Transformations are in dbt, run by Airflow.
- Ingestion is currently:
- Stripe: a Python script pulling REST endpoints nightly, writing JSON to S3, then COPY INTO Snowflake.
- Salesforce: a legacy ETL job using the Bulk API weekly + incremental daily.
- Data quality checks are minimal (row counts only) and there’s no robust idempotency strategy.
Scale Requirements
- Stripe
- ~45M charges/month, 2.5M refunds/month, 150K disputes/month
- Peak webhook/event activity: 2,000 events/sec during high-volume merchant sales
- Late-arriving updates: disputes can change state up to 90 days later
- Salesforce
- ~8M Account/Contact records, 30M OpportunityLineItems, plus custom objects for contract terms
- Updates are bursty during end-of-quarter (10× normal)
- Freshness targets
- Stripe payments/refunds/disputes: < 10 minutes to raw tables
- Salesforce core objects: < 60 minutes
- Curated marts (dbt): < 30 minutes after raw is updated
- Storage & retention
- Keep raw change history for 18 months for audit and investigations
- Maintain an immutable audit log of ingestion actions for 7 years
Data Characteristics
Key entities and example schemas
stripe.charges_raw (append-only)
charge_id (string, PK)
customer_id (string)
amount (integer)
currency (string)
status (string)
created_at (timestamp)
updated_at (timestamp)
ingested_at (timestamp)
payload (variant/json)
stripe.refunds_raw (append-only)
refund_id, charge_id, amount, status, created_at, updated_at, ingested_at, payload
salesforce.opportunity_raw (SCD-ish)
opportunity_id (string)
last_modified_date (timestamp)
is_deleted (boolean)
payload (variant/json)
ingested_at (timestamp)
Common quality issues
- Late-arriving updates: refunds/disputes update long after creation.
- Soft deletes: Salesforce records can be deleted or merged.
- API throttling: Stripe and Salesforce rate limits vary; backfills can trigger throttles.
- Schema drift: Salesforce fields added/renamed; Stripe payloads evolve.
- Duplicates: retries can cause duplicate ingestion unless idempotent.
Your Task
Design an ingestion strategy and make a recommendation: use Fivetran, build a custom Python connector, or a hybrid. Your answer should be concrete and production-oriented.
1) Functional requirements
- Ingest Stripe + Salesforce into Snowflake with the freshness targets above.
- Support incremental loads and backfills (including 90-day dispute state changes).
- Ensure idempotency and correct handling of retries (no double-counting).
- Capture deletes and maintain change history suitable for audits.
- Provide data quality validation beyond row counts (schema, freshness, reconciliation).
- Enable lineage and observability: know what ran, what changed, and why.
2) Non-functional requirements
- Reliability: meet the CFO SLA (99.9% complete by 8:00 AM UTC).
- Security/compliance: SOC2 evidence, least-privilege, secrets management, PII handling.
- Cost: keep incremental platform cost under $35K/month (tooling + compute).
- Operability: on-call should be able to diagnose issues within 15 minutes.
3) Constraints
- Team: 5 data engineers, 1 SRE shared across org. Strong Python/SQL, moderate Kubernetes, limited Kafka.
- Existing stack must remain: Snowflake + dbt + Airflow.
- No new vendor that stores raw PII outside PayPilot’s cloud account unless there’s a compelling compliance story.
- Finance requires an audit trail: when data was fetched, from which API cursor, and what was loaded.
Discussion Prompts (what we’re evaluating)
- Trade-offs: Compare Fivetran vs custom Python connector across time-to-value, ongoing maintenance, schema evolution, backfills, rate limits, security, and total cost of ownership.
- Incremental strategy: How would you implement CDC-like behavior for Stripe (event-driven vs polling) and Salesforce (Bulk API vs REST vs Change Data Capture)?
- Data modeling: How would you structure raw vs history vs current-state tables in Snowflake to support audits and easy analytics?
- Orchestration: How would Airflow coordinate ingestion + dbt + quality checks, including dependency management and reruns?
- Quality and reconciliation: How do you detect missing refunds, mismatched totals, or stale Salesforce snapshots?
- Backfill plan: How do you safely backfill 18 months without blowing API quotas or warehouse costs?
Be explicit about what you would build, what you would buy, and what you would monitor. Assume you need to present your recommendation to the VP of Data and the CFO.