Context
NovaSuite, a B2B SaaS collaboration platform, wants a reliable data model and pipeline for tracking product feature usage across web, mobile, and backend services. Today, event logs land as raw JSON in S3 and are batch-loaded nightly into Snowflake, but inconsistent schemas, duplicate events, and unclear feature definitions make adoption reporting unreliable.
You are asked to design the end-to-end pipeline and warehouse model for feature usage analytics that supports product dashboards, customer success reporting, and downstream experimentation analysis.
Scale Requirements
- Event volume: 120M product events/day, 2.5K events/sec average, 12K events/sec peak
- Event size: ~1.5 KB JSON
- Latency target: raw ingestion < 2 minutes, curated usage tables < 15 minutes
- Retention: raw events for 180 days, curated aggregates for 3 years
- Cardinality: 8M monthly active users, 250K workspaces, 400 tracked features
Requirements
- Design a canonical event schema and warehouse data model to track feature usage by user, workspace/account, feature, platform, and time.
- Support both event-level analysis and pre-aggregated usage tables for daily/weekly reporting.
- Handle schema evolution, duplicate client retries, and late-arriving events up to 72 hours.
- Define how feature metadata is maintained, including feature hierarchy (e.g. editor > comments > mention).
- Build incremental ELT models for common outputs such as
feature_usage_daily, feature_adoption_by_workspace, and user_feature_last_seen.
- Include data quality checks for null keys, invalid feature IDs, event timestamp drift, and volume anomalies.
- Describe orchestration, backfill strategy, and how analysts and PMs will query the curated layer.
Constraints
- Existing stack is AWS + Snowflake + dbt + Airflow; avoid introducing a large new platform unless justified.
- Budget increase is capped at $15K/month.
- Must support GDPR deletion requests within 30 days.
- Product instrumentation is partially inconsistent across clients, so the design must tolerate imperfect source data.