Context
You’re joining the Risk & Growth Analytics org at a mid-stage fintech that offers a consumer credit card and BNPL product in the US. The company has 12M cardholders, ~1.8M DAUs, and processes $4B/month in purchase volume. Decisions like credit line increases, fraud monitoring, and marketing attribution depend on accurate, fresh data.
Today, the analytics stack is a mix of bespoke Python ETL scripts, ad-hoc SQL, and a few vendor-managed connectors. Data lands inconsistently in Snowflake, transformations are hard to reproduce, and metric definitions drift across teams. The VP of Data wants to standardize on a modern data stack: Fivetran for ingestion and dbt for transformations, orchestrated by Airflow, with a strong emphasis on data quality, idempotency, and backfills.
Current state and pain points
- Sources:
- Postgres (core ledger + user accounts), ~3 TB total, high write rate
- Salesforce (support + CRM)
- Stripe (payments)
- Segment (product events)
- Warehouse: Snowflake, but tables are a mix of raw and curated with inconsistent naming.
- Freshness: Some tables update every 6 hours; others are near real-time; no clear SLAs.
- Quality issues:
- Late-arriving ledger adjustments (chargebacks, reversals) can arrive up to 14 days late.
- Duplicate events from Segment retries.
- Schema changes in Salesforce break downstream dashboards.
- Business stakes:
- Incorrect delinquency metrics can trigger regulatory reporting errors.
- Marketing attribution errors can waste $500K+/month in spend.
Your task is to design a complete ingestion + transformation pipeline using Fivetran + dbt that is production-grade.
Scale requirements
- Postgres CDC: peak 8K row changes/sec, average 2K/sec
- Segment events: peak 120K events/sec, average 40K/sec, JSON payloads ~1–2 KB
- Salesforce: ~8M objects total, API limits apply; incremental sync required
- Snowflake:
- Raw data retention: 18 months
- Curated dimensional + fact models: indefinite
- Latency SLAs:
- Ledger + transactions marts: P95 < 15 minutes
- Marketing attribution marts: P95 < 60 minutes
- Backfills: must support reprocessing 90 days of data without manual SQL edits
Data characteristics
Example raw schemas (simplified)
Postgres: ledger_transactions (CDC)
| column | type | notes |
|---|
| transaction_id | string | PK |
| user_id | string | |
| amount_cents | int | |
| currency | string | USD |
| status | string | authorized/settled/reversed |
| updated_at | timestamp | mutable; late updates |
| event_time | timestamp | business time |
Segment: track events
| field | type | notes |
|---|
| messageId | string | unique event id (not always unique in practice) |
| userId | string | may be null |
| anonymousId | string | fallback |
| event | string | e.g., "Card Activated" |
| timestamp | timestamp | client time; can be skewed |
| receivedAt | timestamp | server receipt time |
| properties | variant/json | nested |
Requirements
Functional requirements
- Ingest Postgres, Salesforce, Stripe, and Segment into Snowflake using Fivetran with clear raw-layer conventions.
- Build a dbt project that produces:
- A ledger fact table suitable for finance and risk
- A user dimension with slowly changing attributes (at least Type 2 for key fields)
- A marketing attribution mart joining Segment + Stripe + campaigns
- Handle late-arriving updates (up to 14 days) and mutable records without double counting.
- Implement deduplication for Segment events and idempotent incremental loads.
- Provide a backfill strategy for 90 days and a safe way to re-run models.
Non-functional requirements
- Define data quality checks (dbt tests + custom checks) with ownership and escalation.
- Provide observability: freshness, volume anomalies, schema drift, and lineage.
- Ensure cost control in Snowflake (warehouse sizing, incremental models, clustering strategy).
- Support SOX-like auditability: reproducible transformations, version control, and change reviews.
Constraints
- The company is standardized on Snowflake and Airflow 2.x.
- The team has strong SQL skills but limited Spark/Kafka expertise; prefer ELT where possible.
- Budget target: <$60K/month incremental platform cost (Snowflake compute + Fivetran + orchestration).
- Compliance: PII must be masked in non-prod; production access is role-based; deletions must propagate within 72 hours.
What to deliver (in the interview)
- A target architecture using Fivetran + dbt + Snowflake + Airflow.
- How you would model raw/staging/marts in dbt (naming, schemas, incremental strategy).
- How you’d handle late-arriving data and mutable facts.
- Concrete monitoring/alerting and failure recovery.
- Example dbt model(s) and Airflow orchestration code.