Business Context
Databricks marketing wants a reliable customer segmentation framework for 420K accounts and leads across self-serve, commercial, and enterprise motions. The output will be used to personalize campaigns in Databricks Lakehouse Monitoring dashboards and improve conversion from marketing-qualified lead to pipeline.
Dataset
You are given an account-level dataset built in Delta Lake from 18 months of CRM, product usage, campaign engagement, and firmographic history.
| Feature Group | Count | Examples |
|---|
| Firmographic | 12 | industry, employee_band, region, cloud_provider, funding_stage |
| Marketing engagement | 14 | email_open_rate_90d, webinar_attendance_count, content_downloads_30d, paid_media_clicks_30d |
| Product usage | 16 | workspace_creations_30d, notebook_runs_30d, sql_queries_30d, active_users_30d |
| Sales / lifecycle | 9 | lead_source, opportunity_stage, days_since_last_touch, account_age_days |
| Financial / account value | 7 | estimated_arr, contract_value, expansion_flag, renewal_in_90d |
- Rows: 420K accounts, 58 features
- Target for downstream validation: Binary — converted to qualified pipeline within 60 days
- Missing data: 18% missing in product usage for non-trial accounts, 11% missing in firmographics, 6% missing in engagement metrics
- Data quality issues: right-skewed spend and usage variables, high-cardinality categorical fields, and correlated activity metrics
Success Criteria
A strong solution should produce segments that are both analytically coherent and operationally useful:
- Stable clusters across monthly refreshes
- Clear business interpretation for campaign strategy
- Downstream lift over current rule-based segmentation on pipeline conversion
- A reproducible workflow in Databricks that can score new accounts weekly
Constraints
- Marketing stakeholders need interpretable segments, not black-box embeddings only
- Weekly batch scoring must finish in under 20 minutes on Databricks
- Segment definitions should be refreshable monthly without manual relabeling
Deliverables
- Build an end-to-end segmentation pipeline, including preprocessing and clustering.
- Select the number of segments using quantitative and business criteria.
- Evaluate segment quality and stability, then validate usefulness against 60-day pipeline conversion.
- Propose how to productionize scoring and monitoring in Databricks.