Context
RevOpsHub, a B2B SaaS company with 2,000 internal CRM users, runs Salesforce for sales operations and HubSpot for marketing automation. Today, user permissions, custom object definitions, and data hygiene checks are managed manually through admin workflows and ad hoc exports, which causes stale warehouse data, inconsistent access audits, and poor lead/account quality.
You need to design a pipeline that continuously extracts CRM metadata and business records, standardizes them, and publishes analytics-ready governance tables for security, operations, and GTM reporting.
Scale Requirements
- Sources: Salesforce Sales Cloud + HubSpot CRM APIs/webhooks
- Volume: 25M CRM records total, 150 custom objects, 2,000 users, 8,000 permission assignments
- Change rate: ~3M record updates/day, 50K webhook events/hour peak
- Latency target: < 10 minutes for permission changes, < 30 minutes for data hygiene metrics
- Retention: 2 years for audit history, 90 days raw API payload retention
- Warehouse size: ~4 TB compressed analytics data
Requirements
- Ingest users, roles, profiles, permission sets, object metadata, field metadata, and record-level CRM entities from Salesforce and HubSpot.
- Support both batch backfills and incremental syncs using API cursors,
SystemModstamp, and webhook/event-based updates where available.
- Build canonical warehouse models for
crm_users, crm_permissions, crm_objects, crm_fields, crm_records, and data_hygiene_issues.
- Detect data hygiene issues such as duplicate contacts, invalid owner assignments, missing required fields, stale lifecycle stages, and orphaned custom object references.
- Provide auditability for permission changes and schema evolution, including historical snapshots.
- Orchestrate retries, backfills, and dependency-aware transformations with idempotent loads.
- Expose curated tables for BI dashboards and downstream alerting.
Constraints
- Infrastructure is AWS-first with an existing Snowflake warehouse and Airflow deployment.
- CRM APIs are rate-limited and some objects require paginated extraction.
- Compliance requires SOX-style access auditability and PII minimization in non-production environments.
- Team size is 3 data engineers; operational complexity should stay moderate.