Sync Salesforce CRM to Snowflake

Context

You’re interviewing with the Revenue Operations Data Platform team at a global B2B SaaS company that sells security software to mid-market and enterprise customers. The company runs its go-to-market motion in Salesforce (SFDC): Leads → Contacts/Accounts → Opportunities → Quotes → Contracts. Finance and Sales Ops rely on a Snowflake warehouse for forecasting, pipeline coverage, commissions, and renewal analytics.

Today, a legacy integration (a mix of ad-hoc Python scripts and a managed connector) lands daily CSV exports into S3 and then loads Snowflake. The business is unhappy: dashboards are often 12–24 hours stale, sales reps complain that “Closed Won” deals don’t show up in commission previews until the next day, and Finance has found inconsistencies where an Opportunity’s Amount differs between Salesforce and Snowflake due to missed updates and partial loads. Leadership wants a near-real-time, auditable, and backfillable synchronization pipeline.

Current Architecture (and pain points)

Layer	Current approach	Pain point
Extraction	Daily SFDC bulk export	Misses intra-day updates; hard to capture deletes
Landing	S3 CSV dumps	No schema enforcement; duplicates; hard to replay
Warehouse load	Snowflake COPY into raw tables	Not idempotent; partial loads create drift
Transformations	Some SQL scripts	No lineage/testing; manual reruns
Orchestration	Cron + manual	No SLAs, retries, or alerting

The team wants you to propose a production-grade design to synchronize Salesforce objects into Snowflake with strong correctness guarantees.

Scale Requirements

Salesforce org size: ~2,500 internal users, heavy automation (Flows, Apex triggers)
Objects to sync (initial scope): Account, Contact, Lead, Opportunity, OpportunityLineItem, User, Task, Event
Data volume:
- Opportunities: ~8M historical, ~80K updates/day
- Tasks/Events (activities): ~500M historical, ~5M new/day
Freshness targets:
- Core revenue objects (Opportunity, Line Items): P95 < 10 minutes from SFDC commit to queryable in Snowflake
- Activities: < 60 minutes acceptable
Storage: 5+ years retention in Snowflake; raw history must be preserved for audit
API constraints: Must respect Salesforce API limits (daily and concurrent), and handle throttling gracefully

Data Characteristics & Quality Issues

Late-arriving and out-of-order updates: Salesforce updates can arrive delayed due to async automation; Bulk API extracts can return records not strictly ordered by commit time.
Deletes: Hard deletes and soft deletes (IsDeleted) must be represented in Snowflake.
Schema evolution: Fields get added/renamed; picklists change; custom objects/fields may appear.
Duplicates / replays: Retries and backfills must not double-apply changes.
PII & compliance: Contacts/Leads contain PII; access must be controlled and changes audited.

Assume you can use Salesforce Change Data Capture (CDC) for supported objects, and fall back to Bulk API 2.0 for large backfills and objects not covered by CDC.

Your Task

Design an end-to-end synchronization pipeline between Salesforce and Snowflake.

Functional Requirements

Ingest changes continuously for supported objects using SFDC CDC (or equivalent), including inserts, updates, undeletes, and deletes.
Backfill historical data for each object (multi-year) and support re-backfills (e.g., last 30 days) without corrupting downstream tables.
Maintain raw, immutable history (append-only) and produce current-state tables for analytics.
Implement idempotent upserts into Snowflake so retries do not create duplicates or regress state.
Handle late-arriving updates and ensure the latest record version wins deterministically.
Support schema changes with minimal manual intervention and clear failure modes.
Provide data quality checks (row counts, freshness, null/uniqueness constraints, referential sanity like Opportunity.AccountId exists).

Non-Functional Requirements

Reliability: No silent data loss; at-least-once ingestion is acceptable if warehouse application is idempotent.
Observability: Clear SLAs, lineage, and alerting for lag, load failures, and data drift.
Security: PII controls (masking/row access policies), encrypted transit/at rest, and least-privilege SFDC credentials.
Cost: Keep incremental platform cost under $35k/month; avoid always-on large Snowflake warehouses.

Constraints

Existing stack preference: Airflow, dbt, Snowflake, Kafka are already approved; Spark is available but the team prefers SQL-first where possible.
The team is 5 data engineers; on-call rotation requires automated remediation and safe retries.
Some downstream consumers depend on existing table names; you may introduce new schemas but must provide a migration plan.

What to Cover in Your Answer

The architecture (batch + streaming) and how you reconcile CDC with backfills.
Data model in Snowflake (raw history vs current-state) and keys/watermarks.
Orchestration strategy in Airflow (DAG structure, retries, backfill runs).
How you manage API limits, throttling, and incremental extraction.
Data quality framework (tests, thresholds, quarantine strategy).
Monitoring, alerting, and runbooks for common failures.

You do not need to write a full implementation, but you should be concrete: table schemas, merge logic, and operational details matter.

Context

Current Architecture (and pain points)

Layer	Current approach	Pain point
Extraction	Daily SFDC bulk export	Misses intra-day updates; hard to capture deletes
Landing	S3 CSV dumps	No schema enforcement; duplicates; hard to replay
Warehouse load	Snowflake COPY into raw tables	Not idempotent; partial loads create drift
Transformations	Some SQL scripts	No lineage/testing; manual reruns
Orchestration	Cron + manual	No SLAs, retries, or alerting

The team wants you to propose a production-grade design to synchronize Salesforce objects into Snowflake with strong correctness guarantees.

Scale Requirements

Salesforce org size: ~2,500 internal users, heavy automation (Flows, Apex triggers)
Objects to sync (initial scope): Account, Contact, Lead, Opportunity, OpportunityLineItem, User, Task, Event
Data volume:
- Opportunities: ~8M historical, ~80K updates/day
- Tasks/Events (activities): ~500M historical, ~5M new/day
Freshness targets:
- Core revenue objects (Opportunity, Line Items): P95 < 10 minutes from SFDC commit to queryable in Snowflake
- Activities: < 60 minutes acceptable
Storage: 5+ years retention in Snowflake; raw history must be preserved for audit
API constraints: Must respect Salesforce API limits (daily and concurrent), and handle throttling gracefully

Data Characteristics & Quality Issues

Late-arriving and out-of-order updates: Salesforce updates can arrive delayed due to async automation; Bulk API extracts can return records not strictly ordered by commit time.
Deletes: Hard deletes and soft deletes (IsDeleted) must be represented in Snowflake.
Schema evolution: Fields get added/renamed; picklists change; custom objects/fields may appear.
Duplicates / replays: Retries and backfills must not double-apply changes.
PII & compliance: Contacts/Leads contain PII; access must be controlled and changes audited.

Assume you can use Salesforce Change Data Capture (CDC) for supported objects, and fall back to Bulk API 2.0 for large backfills and objects not covered by CDC.

Your Task

Design an end-to-end synchronization pipeline between Salesforce and Snowflake.

Functional Requirements

Ingest changes continuously for supported objects using SFDC CDC (or equivalent), including inserts, updates, undeletes, and deletes.
Backfill historical data for each object (multi-year) and support re-backfills (e.g., last 30 days) without corrupting downstream tables.
Maintain raw, immutable history (append-only) and produce current-state tables for analytics.
Implement idempotent upserts into Snowflake so retries do not create duplicates or regress state.
Handle late-arriving updates and ensure the latest record version wins deterministically.
Support schema changes with minimal manual intervention and clear failure modes.
Provide data quality checks (row counts, freshness, null/uniqueness constraints, referential sanity like Opportunity.AccountId exists).

Non-Functional Requirements

Reliability: No silent data loss; at-least-once ingestion is acceptable if warehouse application is idempotent.
Observability: Clear SLAs, lineage, and alerting for lag, load failures, and data drift.
Security: PII controls (masking/row access policies), encrypted transit/at rest, and least-privilege SFDC credentials.
Cost: Keep incremental platform cost under $35k/month; avoid always-on large Snowflake warehouses.

Constraints

Existing stack preference: Airflow, dbt, Snowflake, Kafka are already approved; Spark is available but the team prefers SQL-first where possible.
The team is 5 data engineers; on-call rotation requires automated remediation and safe retries.
Some downstream consumers depend on existing table names; you may introduce new schemas but must provide a migration plan.

What to Cover in Your Answer

The architecture (batch + streaming) and how you reconcile CDC with backfills.
Data model in Snowflake (raw history vs current-state) and keys/watermarks.
Orchestration strategy in Airflow (DAG structure, retries, backfill runs).
How you manage API limits, throttling, and incremental extraction.
Data quality framework (tests, thresholds, quarantine strategy).
Monitoring, alerting, and runbooks for common failures.

You do not need to write a full implementation, but you should be concrete: table schemas, merge logic, and operational details matter.

Context

Current Architecture (and pain points)

Layer	Current approach	Pain point
Extraction	Daily SFDC bulk export	Misses intra-day updates; hard to capture deletes
Landing	S3 CSV dumps	No schema enforcement; duplicates; hard to replay
Warehouse load	Snowflake COPY into raw tables	Not idempotent; partial loads create drift
Transformations	Some SQL scripts	No lineage/testing; manual reruns
Orchestration	Cron + manual	No SLAs, retries, or alerting

The team wants you to propose a production-grade design to synchronize Salesforce objects into Snowflake with strong correctness guarantees.

Scale Requirements

Salesforce org size: ~2,500 internal users, heavy automation (Flows, Apex triggers)
Objects to sync (initial scope): Account, Contact, Lead, Opportunity, OpportunityLineItem, User, Task, Event
Data volume:
- Opportunities: ~8M historical, ~80K updates/day
- Tasks/Events (activities): ~500M historical, ~5M new/day
Freshness targets:
- Core revenue objects (Opportunity, Line Items): P95 < 10 minutes from SFDC commit to queryable in Snowflake
- Activities: < 60 minutes acceptable
Storage: 5+ years retention in Snowflake; raw history must be preserved for audit
API constraints: Must respect Salesforce API limits (daily and concurrent), and handle throttling gracefully

Data Characteristics & Quality Issues

Late-arriving and out-of-order updates: Salesforce updates can arrive delayed due to async automation; Bulk API extracts can return records not strictly ordered by commit time.
Deletes: Hard deletes and soft deletes (IsDeleted) must be represented in Snowflake.
Schema evolution: Fields get added/renamed; picklists change; custom objects/fields may appear.
Duplicates / replays: Retries and backfills must not double-apply changes.
PII & compliance: Contacts/Leads contain PII; access must be controlled and changes audited.

Assume you can use Salesforce Change Data Capture (CDC) for supported objects, and fall back to Bulk API 2.0 for large backfills and objects not covered by CDC.

Your Task

Design an end-to-end synchronization pipeline between Salesforce and Snowflake.

Functional Requirements

Ingest changes continuously for supported objects using SFDC CDC (or equivalent), including inserts, updates, undeletes, and deletes.
Backfill historical data for each object (multi-year) and support re-backfills (e.g., last 30 days) without corrupting downstream tables.
Maintain raw, immutable history (append-only) and produce current-state tables for analytics.
Implement idempotent upserts into Snowflake so retries do not create duplicates or regress state.
Handle late-arriving updates and ensure the latest record version wins deterministically.
Support schema changes with minimal manual intervention and clear failure modes.
Provide data quality checks (row counts, freshness, null/uniqueness constraints, referential sanity like Opportunity.AccountId exists).

Non-Functional Requirements

Reliability: No silent data loss; at-least-once ingestion is acceptable if warehouse application is idempotent.
Observability: Clear SLAs, lineage, and alerting for lag, load failures, and data drift.
Security: PII controls (masking/row access policies), encrypted transit/at rest, and least-privilege SFDC credentials.
Cost: Keep incremental platform cost under $35k/month; avoid always-on large Snowflake warehouses.

Constraints

Existing stack preference: Airflow, dbt, Snowflake, Kafka are already approved; Spark is available but the team prefers SQL-first where possible.
The team is 5 data engineers; on-call rotation requires automated remediation and safe retries.
Some downstream consumers depend on existing table names; you may introduce new schemas but must provide a migration plan.

What to Cover in Your Answer

The architecture (batch + streaming) and how you reconcile CDC with backfills.
Data model in Snowflake (raw history vs current-state) and keys/watermarks.
Orchestration strategy in Airflow (DAG structure, retries, backfill runs).
How you manage API limits, throttling, and incremental extraction.
Data quality framework (tests, thresholds, quarantine strategy).
Monitoring, alerting, and runbooks for common failures.

You do not need to write a full implementation, but you should be concrete: table schemas, merge logic, and operational details matter.

Context

Current Architecture (and pain points)

Layer	Current approach	Pain point
Extraction	Daily SFDC bulk export	Misses intra-day updates; hard to capture deletes
Landing	S3 CSV dumps	No schema enforcement; duplicates; hard to replay
Warehouse load	Snowflake COPY into raw tables	Not idempotent; partial loads create drift
Transformations	Some SQL scripts	No lineage/testing; manual reruns
Orchestration	Cron + manual	No SLAs, retries, or alerting

The team wants you to propose a production-grade design to synchronize Salesforce objects into Snowflake with strong correctness guarantees.

Scale Requirements

Salesforce org size: ~2,500 internal users, heavy automation (Flows, Apex triggers)
Objects to sync (initial scope): Account, Contact, Lead, Opportunity, OpportunityLineItem, User, Task, Event
Data volume:
- Opportunities: ~8M historical, ~80K updates/day
- Tasks/Events (activities): ~500M historical, ~5M new/day
Freshness targets:
- Core revenue objects (Opportunity, Line Items): P95 < 10 minutes from SFDC commit to queryable in Snowflake
- Activities: < 60 minutes acceptable
Storage: 5+ years retention in Snowflake; raw history must be preserved for audit
API constraints: Must respect Salesforce API limits (daily and concurrent), and handle throttling gracefully

Data Characteristics & Quality Issues

Late-arriving and out-of-order updates: Salesforce updates can arrive delayed due to async automation; Bulk API extracts can return records not strictly ordered by commit time.
Deletes: Hard deletes and soft deletes (IsDeleted) must be represented in Snowflake.
Schema evolution: Fields get added/renamed; picklists change; custom objects/fields may appear.
Duplicates / replays: Retries and backfills must not double-apply changes.
PII & compliance: Contacts/Leads contain PII; access must be controlled and changes audited.

Assume you can use Salesforce Change Data Capture (CDC) for supported objects, and fall back to Bulk API 2.0 for large backfills and objects not covered by CDC.

Your Task

Design an end-to-end synchronization pipeline between Salesforce and Snowflake.

Functional Requirements

Ingest changes continuously for supported objects using SFDC CDC (or equivalent), including inserts, updates, undeletes, and deletes.
Backfill historical data for each object (multi-year) and support re-backfills (e.g., last 30 days) without corrupting downstream tables.
Maintain raw, immutable history (append-only) and produce current-state tables for analytics.
Implement idempotent upserts into Snowflake so retries do not create duplicates or regress state.
Handle late-arriving updates and ensure the latest record version wins deterministically.
Support schema changes with minimal manual intervention and clear failure modes.
Provide data quality checks (row counts, freshness, null/uniqueness constraints, referential sanity like Opportunity.AccountId exists).

Non-Functional Requirements

Reliability: No silent data loss; at-least-once ingestion is acceptable if warehouse application is idempotent.
Observability: Clear SLAs, lineage, and alerting for lag, load failures, and data drift.
Security: PII controls (masking/row access policies), encrypted transit/at rest, and least-privilege SFDC credentials.
Cost: Keep incremental platform cost under $35k/month; avoid always-on large Snowflake warehouses.

Constraints

Existing stack preference: Airflow, dbt, Snowflake, Kafka are already approved; Spark is available but the team prefers SQL-first where possible.
The team is 5 data engineers; on-call rotation requires automated remediation and safe retries.
Some downstream consumers depend on existing table names; you may introduce new schemas but must provide a migration plan.

What to Cover in Your Answer

The architecture (batch + streaming) and how you reconcile CDC with backfills.
Data model in Snowflake (raw history vs current-state) and keys/watermarks.
Orchestration strategy in Airflow (DAG structure, retries, backfill runs).
How you manage API limits, throttling, and incremental extraction.
Data quality framework (tests, thresholds, quarantine strategy).
Monitoring, alerting, and runbooks for common failures.

You do not need to write a full implementation, but you should be concrete: table schemas, merge logic, and operational details matter.

Interview Guides

Context

Current Architecture (and pain points)

Scale Requirements

Data Characteristics & Quality Issues

Your Task

Functional Requirements

Non-Functional Requirements

Constraints

What to Cover in Your Answer

Sync Salesforce CRM to Snowflake

Context

Current Architecture (and pain points)

Scale Requirements

Data Characteristics & Quality Issues

Your Task

Functional Requirements

Non-Functional Requirements

Constraints

What to Cover in Your Answer

Your Answer

Sync Salesforce CRM to Snowflake

Context

Current Architecture (and pain points)

Scale Requirements

Data Characteristics & Quality Issues

Your Task

Functional Requirements

Non-Functional Requirements

Constraints

What to Cover in Your Answer

Sync Salesforce CRM to Snowflake

Context

Current Architecture (and pain points)

Scale Requirements

Data Characteristics & Quality Issues

Your Task

Functional Requirements

Non-Functional Requirements

Constraints

What to Cover in Your Answer

Your Answer