Context
You’re interviewing with the Revenue Operations Data Platform team at CloudDesk, a B2B SaaS company selling customer support software to mid-market and enterprise clients. CloudDesk has ~120K paying business customers, ~2.5M monthly active end-users, and ~18K internal sales/support reps (“Repls”) who are organized into Teams (e.g., SMB East, Enterprise Strategic, Renewals). Team membership changes frequently due to org changes, reassignments, and temporary coverage.
The business relies on accurate historical reporting for: (1) commission and quota attainment (finance audit risk), (2) SLA compliance for customer escalations (contractual penalties), and (3) pipeline attribution (revenue forecasting). Today, analytics in Snowflake uses a single “current state” table for team membership. This causes incorrect historical metrics (e.g., a rep moved teams last week, and all prior deals now appear under the new team).
CloudDesk’s operational sources are:
- Salesforce (opportunities, accounts, rep assignments)
- Internal Admin Service (team definitions, membership changes)
- Okta (user identity lifecycle; hires/terminations)
- Zendesk (tickets assigned to reps)
Your task is to design a production-grade data model + ETL/ELT pipeline to represent the relationship between Users, Teams, and Repls (a “Repl” is an internal user with a sales/support role) in a way that supports point-in-time and historical reporting, while handling late-arriving events, backfills, and data quality.
Scale Requirements
- Membership change events: avg 50K/day, peak 10K/hour during reorganizations
- Repls: ~18K active, ~60K historical
- Teams: ~2K active, ~10K historical (renames, merges, splits)
- Downstream facts:
- Opportunities: ~80M rows historical
- Tickets: ~2B events historical
- Freshness: membership changes queryable in Snowflake within 15 minutes (near-real-time acceptable)
- Correctness: point-in-time attribution must be correct for any date in the last 7 years
- Retention: keep all history indefinitely (SOX-like audit expectations)
Data Characteristics & Challenges
Operational event payload (Admin Service)
Membership changes are emitted as JSON events to Kafka:
| Field | Example | Notes |
|---|
| event_id | "7f9…" | globally unique |
| event_type | "MEMBERSHIP_ADDED" | also REMOVED, TEAM_RENAMED, TEAM_MERGED |
| repl_id | "r_123" | nullable for team-only events |
| team_id | "t_456" | team affected |
| effective_at | "2026-01-10T09:00:00Z" | business-effective time |
| emitted_at | "2026-01-10T09:03:12Z" | ingestion time |
| actor | "admin@…" | who made the change |
Known issues
- Late-arriving events: effective_at can be backdated by up to 30 days (HR corrections).
- Out-of-order delivery: Kafka ordering is only guaranteed per partition; some events arrive out of order.
- Duplicates: retries can produce duplicate event_id or semantically duplicate events.
- Team merges/splits: a merge can retroactively re-map membership; you must preserve original history and also support “as-of” truth.
- Identity mapping: Okta user_id ↔ internal repl_id mapping can change (contractors converted to FTE).
Requirements
Functional
- Model Users, Repls, Teams such that analysts can answer:
- “Which team was repl X in on 2025-11-15?”
- “For each opportunity close date, attribute revenue to the rep’s team at that time.”
- “How many reps were in Team Y each day last quarter?”
- Support many-to-many over time (a repl can belong to multiple teams simultaneously for coverage, with primary/secondary roles).
- Support team hierarchy (team → parent team) with history (reorgs).
- Provide slowly changing dimension semantics with explicit validity windows.
- Enable backfills and reprocessing when late events arrive.
Non-functional
- Idempotent loads and deterministic outputs.
- Auditable lineage: every derived row should be traceable to source events.
- Data quality: enforce referential integrity, no overlapping validity windows for the same repl/team/role.
- Performance: common attribution queries over 2B tickets should run in <30s with proper pruning.
Constraints
- Primary warehouse is Snowflake; transformations are done with dbt.
- Orchestration is Apache Airflow 2.x.
- Streaming is available via Kafka; you may use Spark Structured Streaming for near-real-time ingestion.
- Budget: avoid always-on large clusters; prefer micro-batch or serverless patterns.
- Compliance: produce an audit trail suitable for finance review; soft deletes must be represented (do not physically delete history).
What you should deliver (as the candidate)
- Propose the dimensional model (tables, keys, validity columns, constraints) for Users/Repls/Teams membership history.
- Describe the ETL/ELT pipeline from Kafka/Admin Service + Salesforce + Okta into Snowflake, including how you handle late/out-of-order events.
- Show how downstream fact tables (opportunities, tickets) will be attributed to team membership as-of event time.
- Define data quality checks and operational monitoring.
- Explain how you’d backfill 7 years of history and keep it correct going forward.