Classify and Segment Retail Customers

Business Context

ShopSphere, an online retail marketplace with 2M monthly users, wants to improve customer targeting. The growth team needs both a supervised model to predict whether a customer will purchase in the next 30 days and an unsupervised model to segment customers for marketing campaigns.

Dataset

You are given a customer-level dataset built from the last 12 months of activity.

Feature Group	Count	Examples
Demographics	5	age, region, device_type, acquisition_channel
Behavioral	10	sessions_last_30d, avg_session_length, pages_per_session, cart_add_rate
Transactional	8	orders_last_90d, avg_order_value, refund_rate, days_since_last_purchase
Engagement	5	email_open_rate, push_click_rate, wishlist_count, support_tickets
Target	1	purchased_next_30d

Size: 120K customers, 28 input features
Target: Binary label for supervised learning — whether the customer makes a purchase in the next 30 days
Class balance: 22% positive, 78% negative
Missing data: 8% missing in engagement features, 3% missing in demographics

Success Criteria

A good solution should:

Achieve ROC-AUC >= 0.82 on the supervised task
Produce interpretable customer segments with a silhouette score >= 0.30 on the unsupervised task
Clearly explain when supervised vs. unsupervised learning is appropriate

Constraints

Predictions will run daily in batch on 120K customers
Marketing stakeholders need interpretable outputs
Training should complete within 30 minutes on a standard cloud VM

Deliverables

Build a supervised learning pipeline to predict purchased_next_30d
Build an unsupervised learning pipeline to segment customers without using the target
Compare the two approaches and explain the difference in objective, data requirements, and evaluation
Report model performance with appropriate metrics
Recommend how both outputs would be used in production

Business Context

Dataset

You are given a customer-level dataset built from the last 12 months of activity.

Feature Group	Count	Examples
Demographics	5	age, region, device_type, acquisition_channel
Behavioral	10	sessions_last_30d, avg_session_length, pages_per_session, cart_add_rate
Transactional	8	orders_last_90d, avg_order_value, refund_rate, days_since_last_purchase
Engagement	5	email_open_rate, push_click_rate, wishlist_count, support_tickets
Target	1	purchased_next_30d

Size: 120K customers, 28 input features
Target: Binary label for supervised learning — whether the customer makes a purchase in the next 30 days
Class balance: 22% positive, 78% negative
Missing data: 8% missing in engagement features, 3% missing in demographics

Success Criteria

A good solution should:

Achieve ROC-AUC >= 0.82 on the supervised task
Produce interpretable customer segments with a silhouette score >= 0.30 on the unsupervised task
Clearly explain when supervised vs. unsupervised learning is appropriate

Constraints

Predictions will run daily in batch on 120K customers
Marketing stakeholders need interpretable outputs
Training should complete within 30 minutes on a standard cloud VM

Deliverables

Build a supervised learning pipeline to predict purchased_next_30d
Build an unsupervised learning pipeline to segment customers without using the target
Compare the two approaches and explain the difference in objective, data requirements, and evaluation
Report model performance with appropriate metrics
Recommend how both outputs would be used in production

Business Context

Dataset

You are given a customer-level dataset built from the last 12 months of activity.

Feature Group	Count	Examples
Demographics	5	age, region, device_type, acquisition_channel
Behavioral	10	sessions_last_30d, avg_session_length, pages_per_session, cart_add_rate
Transactional	8	orders_last_90d, avg_order_value, refund_rate, days_since_last_purchase
Engagement	5	email_open_rate, push_click_rate, wishlist_count, support_tickets
Target	1	purchased_next_30d

Size: 120K customers, 28 input features
Target: Binary label for supervised learning — whether the customer makes a purchase in the next 30 days
Class balance: 22% positive, 78% negative
Missing data: 8% missing in engagement features, 3% missing in demographics

Success Criteria

A good solution should:

Achieve ROC-AUC >= 0.82 on the supervised task
Produce interpretable customer segments with a silhouette score >= 0.30 on the unsupervised task
Clearly explain when supervised vs. unsupervised learning is appropriate

Constraints

Predictions will run daily in batch on 120K customers
Marketing stakeholders need interpretable outputs
Training should complete within 30 minutes on a standard cloud VM

Deliverables

Build a supervised learning pipeline to predict purchased_next_30d
Build an unsupervised learning pipeline to segment customers without using the target
Compare the two approaches and explain the difference in objective, data requirements, and evaluation
Report model performance with appropriate metrics
Recommend how both outputs would be used in production

Business Context

Dataset

You are given a customer-level dataset built from the last 12 months of activity.

Feature Group	Count	Examples
Demographics	5	age, region, device_type, acquisition_channel
Behavioral	10	sessions_last_30d, avg_session_length, pages_per_session, cart_add_rate
Transactional	8	orders_last_90d, avg_order_value, refund_rate, days_since_last_purchase
Engagement	5	email_open_rate, push_click_rate, wishlist_count, support_tickets
Target	1	purchased_next_30d

Size: 120K customers, 28 input features
Target: Binary label for supervised learning — whether the customer makes a purchase in the next 30 days
Class balance: 22% positive, 78% negative
Missing data: 8% missing in engagement features, 3% missing in demographics

Success Criteria

A good solution should:

Achieve ROC-AUC >= 0.82 on the supervised task
Produce interpretable customer segments with a silhouette score >= 0.30 on the unsupervised task
Clearly explain when supervised vs. unsupervised learning is appropriate

Constraints

Predictions will run daily in batch on 120K customers
Marketing stakeholders need interpretable outputs
Training should complete within 30 minutes on a standard cloud VM

Deliverables

Build a supervised learning pipeline to predict purchased_next_30d
Build an unsupervised learning pipeline to segment customers without using the target
Compare the two approaches and explain the difference in objective, data requirements, and evaluation
Report model performance with appropriate metrics
Recommend how both outputs would be used in production

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify and Segment Retail Customers

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Classify and Segment Retail Customers

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify and Segment Retail Customers

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer