Context
You’re interviewing with the Revenue Operations Data Platform team at a global B2B SaaS company that sells security software to mid-market and enterprise customers. The company runs its go-to-market motion in Salesforce (SFDC): Leads → Contacts/Accounts → Opportunities → Quotes → Contracts. Finance and Sales Ops rely on a Snowflake warehouse for forecasting, pipeline coverage, commissions, and renewal analytics.
Today, a legacy integration (a mix of ad-hoc Python scripts and a managed connector) lands daily CSV exports into S3 and then loads Snowflake. The business is unhappy: dashboards are often 12–24 hours stale, sales reps complain that “Closed Won” deals don’t show up in commission previews until the next day, and Finance has found inconsistencies where an Opportunity’s Amount differs between Salesforce and Snowflake due to missed updates and partial loads. Leadership wants a near-real-time, auditable, and backfillable synchronization pipeline.
Current Architecture (and pain points)
| Layer | Current approach | Pain point |
|---|
| Extraction | Daily SFDC bulk export | Misses intra-day updates; hard to capture deletes |
| Landing | S3 CSV dumps | No schema enforcement; duplicates; hard to replay |
| Warehouse load | Snowflake COPY into raw tables | Not idempotent; partial loads create drift |
| Transformations | Some SQL scripts | No lineage/testing; manual reruns |
| Orchestration | Cron + manual | No SLAs, retries, or alerting |
The team wants you to propose a production-grade design to synchronize Salesforce objects into Snowflake with strong correctness guarantees.
Scale Requirements
- Salesforce org size: ~2,500 internal users, heavy automation (Flows, Apex triggers)
- Objects to sync (initial scope): Account, Contact, Lead, Opportunity, OpportunityLineItem, User, Task, Event
- Data volume:
- Opportunities: ~8M historical, ~80K updates/day
- Tasks/Events (activities): ~500M historical, ~5M new/day
- Freshness targets:
- Core revenue objects (Opportunity, Line Items): P95 < 10 minutes from SFDC commit to queryable in Snowflake
- Activities: < 60 minutes acceptable
- Storage: 5+ years retention in Snowflake; raw history must be preserved for audit
- API constraints: Must respect Salesforce API limits (daily and concurrent), and handle throttling gracefully
Data Characteristics & Quality Issues
- Late-arriving and out-of-order updates: Salesforce updates can arrive delayed due to async automation; Bulk API extracts can return records not strictly ordered by commit time.
- Deletes: Hard deletes and soft deletes (IsDeleted) must be represented in Snowflake.
- Schema evolution: Fields get added/renamed; picklists change; custom objects/fields may appear.
- Duplicates / replays: Retries and backfills must not double-apply changes.
- PII & compliance: Contacts/Leads contain PII; access must be controlled and changes audited.
Assume you can use Salesforce Change Data Capture (CDC) for supported objects, and fall back to Bulk API 2.0 for large backfills and objects not covered by CDC.
Your Task
Design an end-to-end synchronization pipeline between Salesforce and Snowflake.
Functional Requirements
- Ingest changes continuously for supported objects using SFDC CDC (or equivalent), including inserts, updates, undeletes, and deletes.
- Backfill historical data for each object (multi-year) and support re-backfills (e.g., last 30 days) without corrupting downstream tables.
- Maintain raw, immutable history (append-only) and produce current-state tables for analytics.
- Implement idempotent upserts into Snowflake so retries do not create duplicates or regress state.
- Handle late-arriving updates and ensure the latest record version wins deterministically.
- Support schema changes with minimal manual intervention and clear failure modes.
- Provide data quality checks (row counts, freshness, null/uniqueness constraints, referential sanity like Opportunity.AccountId exists).
Non-Functional Requirements
- Reliability: No silent data loss; at-least-once ingestion is acceptable if warehouse application is idempotent.
- Observability: Clear SLAs, lineage, and alerting for lag, load failures, and data drift.
- Security: PII controls (masking/row access policies), encrypted transit/at rest, and least-privilege SFDC credentials.
- Cost: Keep incremental platform cost under $35k/month; avoid always-on large Snowflake warehouses.
Constraints
- Existing stack preference: Airflow, dbt, Snowflake, Kafka are already approved; Spark is available but the team prefers SQL-first where possible.
- The team is 5 data engineers; on-call rotation requires automated remediation and safe retries.
- Some downstream consumers depend on existing table names; you may introduce new schemas but must provide a migration plan.
What to Cover in Your Answer
- The architecture (batch + streaming) and how you reconcile CDC with backfills.
- Data model in Snowflake (raw history vs current-state) and keys/watermarks.
- Orchestration strategy in Airflow (DAG structure, retries, backfill runs).
- How you manage API limits, throttling, and incremental extraction.
- Data quality framework (tests, thresholds, quarantine strategy).
- Monitoring, alerting, and runbooks for common failures.
You do not need to write a full implementation, but you should be concrete: table schemas, merge logic, and operational details matter.