Classify and Segment Retail Customers

Business Context

ShopSphere, an e-commerce marketplace with 2.4M monthly users, wants two capabilities from the same customer dataset: (1) predict whether a customer will make a purchase in the next 30 days, and (2) discover natural customer segments for lifecycle marketing. This question tests whether you understand when to use supervised vs. unsupervised learning and how to implement both correctly.

Dataset

You are given a customer-level feature table built from the last 90 days of activity.

Feature Group	Count	Examples
Behavioral	12	sessions_30d, avg_session_minutes, product_views_30d, cart_adds_30d
Transactional	8	orders_90d, avg_order_value, discount_usage_rate, returns_90d
Marketing	5	email_open_rate, push_click_rate, acquisition_channel
Profile	6	country, device_type, tenure_days, loyalty_tier
Derived	7	recency_days, views_per_session, cart_to_view_rate

Size: 120K customers, 38 features
Target for supervised task: purchased_next_30d (1 if customer purchases in the next 30 days, else 0)
Class balance: 22% positive, 78% negative
Missing data: ~9% missing in marketing engagement fields, ~4% missing in avg_order_value for customers with no prior purchases

Success Criteria

A good solution should:

Build a supervised model that achieves ROC-AUC >= 0.82 and F1 >= 0.60 on a held-out test set
Produce customer segments with silhouette score >= 0.25 and clear business interpretation
Clearly explain why prediction uses supervised learning and segmentation uses unsupervised learning

Constraints

Batch scoring must finish in under 10 minutes for 120K customers
Marketing stakeholders need interpretable outputs: feature importance for prediction and understandable cluster profiles
Retraining can happen weekly; no real-time serving is required

Deliverables

Build a supervised model to predict purchased_next_30d
Build an unsupervised model to segment customers
Compare the goals, inputs, outputs, and evaluation of both approaches
Describe preprocessing choices for mixed feature types and missing data
Recommend how both models would be used together in production

Business Context

Dataset

You are given a customer-level feature table built from the last 90 days of activity.

Feature Group	Count	Examples
Behavioral	12	sessions_30d, avg_session_minutes, product_views_30d, cart_adds_30d
Transactional	8	orders_90d, avg_order_value, discount_usage_rate, returns_90d
Marketing	5	email_open_rate, push_click_rate, acquisition_channel
Profile	6	country, device_type, tenure_days, loyalty_tier
Derived	7	recency_days, views_per_session, cart_to_view_rate

Size: 120K customers, 38 features
Target for supervised task: purchased_next_30d (1 if customer purchases in the next 30 days, else 0)
Class balance: 22% positive, 78% negative
Missing data: ~9% missing in marketing engagement fields, ~4% missing in avg_order_value for customers with no prior purchases

Success Criteria

A good solution should:

Build a supervised model that achieves ROC-AUC >= 0.82 and F1 >= 0.60 on a held-out test set
Produce customer segments with silhouette score >= 0.25 and clear business interpretation
Clearly explain why prediction uses supervised learning and segmentation uses unsupervised learning

Constraints

Batch scoring must finish in under 10 minutes for 120K customers
Marketing stakeholders need interpretable outputs: feature importance for prediction and understandable cluster profiles
Retraining can happen weekly; no real-time serving is required

Deliverables

Build a supervised model to predict purchased_next_30d
Build an unsupervised model to segment customers
Compare the goals, inputs, outputs, and evaluation of both approaches
Describe preprocessing choices for mixed feature types and missing data
Recommend how both models would be used together in production

Business Context

Dataset

You are given a customer-level feature table built from the last 90 days of activity.

Feature Group	Count	Examples
Behavioral	12	sessions_30d, avg_session_minutes, product_views_30d, cart_adds_30d
Transactional	8	orders_90d, avg_order_value, discount_usage_rate, returns_90d
Marketing	5	email_open_rate, push_click_rate, acquisition_channel
Profile	6	country, device_type, tenure_days, loyalty_tier
Derived	7	recency_days, views_per_session, cart_to_view_rate

Size: 120K customers, 38 features
Target for supervised task: purchased_next_30d (1 if customer purchases in the next 30 days, else 0)
Class balance: 22% positive, 78% negative
Missing data: ~9% missing in marketing engagement fields, ~4% missing in avg_order_value for customers with no prior purchases

Success Criteria

A good solution should:

Build a supervised model that achieves ROC-AUC >= 0.82 and F1 >= 0.60 on a held-out test set
Produce customer segments with silhouette score >= 0.25 and clear business interpretation
Clearly explain why prediction uses supervised learning and segmentation uses unsupervised learning

Constraints

Batch scoring must finish in under 10 minutes for 120K customers
Marketing stakeholders need interpretable outputs: feature importance for prediction and understandable cluster profiles
Retraining can happen weekly; no real-time serving is required

Deliverables

Build a supervised model to predict purchased_next_30d
Build an unsupervised model to segment customers
Compare the goals, inputs, outputs, and evaluation of both approaches
Describe preprocessing choices for mixed feature types and missing data
Recommend how both models would be used together in production

Business Context

Dataset

You are given a customer-level feature table built from the last 90 days of activity.

Feature Group	Count	Examples
Behavioral	12	sessions_30d, avg_session_minutes, product_views_30d, cart_adds_30d
Transactional	8	orders_90d, avg_order_value, discount_usage_rate, returns_90d
Marketing	5	email_open_rate, push_click_rate, acquisition_channel
Profile	6	country, device_type, tenure_days, loyalty_tier
Derived	7	recency_days, views_per_session, cart_to_view_rate

Size: 120K customers, 38 features
Target for supervised task: purchased_next_30d (1 if customer purchases in the next 30 days, else 0)
Class balance: 22% positive, 78% negative
Missing data: ~9% missing in marketing engagement fields, ~4% missing in avg_order_value for customers with no prior purchases

Success Criteria

A good solution should:

Build a supervised model that achieves ROC-AUC >= 0.82 and F1 >= 0.60 on a held-out test set
Produce customer segments with silhouette score >= 0.25 and clear business interpretation
Clearly explain why prediction uses supervised learning and segmentation uses unsupervised learning

Constraints

Batch scoring must finish in under 10 minutes for 120K customers
Marketing stakeholders need interpretable outputs: feature importance for prediction and understandable cluster profiles
Retraining can happen weekly; no real-time serving is required

Deliverables

Build a supervised model to predict purchased_next_30d
Build an unsupervised model to segment customers
Compare the goals, inputs, outputs, and evaluation of both approaches
Describe preprocessing choices for mixed feature types and missing data
Recommend how both models would be used together in production

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify and Segment Retail Customers

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Classify and Segment Retail Customers

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify and Segment Retail Customers

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer