Classify and Segment Retail Customers

Business Context

ShopSphere, an online retail marketplace with 2M monthly users, wants to improve customer targeting. The marketing team needs both a supervised model to predict whether a customer will make a purchase in the next 30 days and an unsupervised approach to segment customers for campaign design.

Dataset

You are given a customer-level dataset built from the last 12 months of activity.

Feature Group	Count	Examples
Demographics	5	age_band, region, device_type, acquisition_channel
Behavioral	12	sessions_30d, avg_session_duration, pages_per_session, cart_additions_30d
Transactional	8	orders_90d, avg_order_value, discount_usage_rate, returns_90d
Engagement	6	email_opens_30d, push_click_rate, days_since_last_visit, wishlist_items
Target	1	purchased_next_30d

Size: 120K customers, 31 input features
Target: Binary label for supervised learning — whether the customer purchases in the next 30 days
Class balance: 28% positive, 72% negative
Missing data: 10% missing in demographics, 6% missing in engagement fields, especially for newly acquired users

Success Criteria

A strong solution should:

Achieve ROC-AUC >= 0.82 for purchase prediction
Produce 3-6 actionable customer segments with clear behavioral differences
Clearly explain the difference between when to use supervised vs. unsupervised learning

Constraints

Predictions will run in a daily batch job on all active customers
Marketing stakeholders need interpretable outputs, not just raw scores
The segmentation approach should be stable enough to refresh monthly

Deliverables

Build a supervised learning pipeline to predict purchased_next_30d
Build an unsupervised learning pipeline to segment customers
Compare the goals, inputs, outputs, and evaluation of both approaches
Explain preprocessing choices for mixed numerical and categorical data
Recommend how both models would be used together in production

Business Context

Dataset

You are given a customer-level dataset built from the last 12 months of activity.

Feature Group	Count	Examples
Demographics	5	age_band, region, device_type, acquisition_channel
Behavioral	12	sessions_30d, avg_session_duration, pages_per_session, cart_additions_30d
Transactional	8	orders_90d, avg_order_value, discount_usage_rate, returns_90d
Engagement	6	email_opens_30d, push_click_rate, days_since_last_visit, wishlist_items
Target	1	purchased_next_30d

Size: 120K customers, 31 input features
Target: Binary label for supervised learning — whether the customer purchases in the next 30 days
Class balance: 28% positive, 72% negative
Missing data: 10% missing in demographics, 6% missing in engagement fields, especially for newly acquired users

Success Criteria

A strong solution should:

Achieve ROC-AUC >= 0.82 for purchase prediction
Produce 3-6 actionable customer segments with clear behavioral differences
Clearly explain the difference between when to use supervised vs. unsupervised learning

Constraints

Predictions will run in a daily batch job on all active customers
Marketing stakeholders need interpretable outputs, not just raw scores
The segmentation approach should be stable enough to refresh monthly

Deliverables

Build a supervised learning pipeline to predict purchased_next_30d
Build an unsupervised learning pipeline to segment customers
Compare the goals, inputs, outputs, and evaluation of both approaches
Explain preprocessing choices for mixed numerical and categorical data
Recommend how both models would be used together in production

Business Context

Dataset

You are given a customer-level dataset built from the last 12 months of activity.

Feature Group	Count	Examples
Demographics	5	age_band, region, device_type, acquisition_channel
Behavioral	12	sessions_30d, avg_session_duration, pages_per_session, cart_additions_30d
Transactional	8	orders_90d, avg_order_value, discount_usage_rate, returns_90d
Engagement	6	email_opens_30d, push_click_rate, days_since_last_visit, wishlist_items
Target	1	purchased_next_30d

Size: 120K customers, 31 input features
Target: Binary label for supervised learning — whether the customer purchases in the next 30 days
Class balance: 28% positive, 72% negative
Missing data: 10% missing in demographics, 6% missing in engagement fields, especially for newly acquired users

Success Criteria

A strong solution should:

Achieve ROC-AUC >= 0.82 for purchase prediction
Produce 3-6 actionable customer segments with clear behavioral differences
Clearly explain the difference between when to use supervised vs. unsupervised learning

Constraints

Predictions will run in a daily batch job on all active customers
Marketing stakeholders need interpretable outputs, not just raw scores
The segmentation approach should be stable enough to refresh monthly

Deliverables

Build a supervised learning pipeline to predict purchased_next_30d
Build an unsupervised learning pipeline to segment customers
Compare the goals, inputs, outputs, and evaluation of both approaches
Explain preprocessing choices for mixed numerical and categorical data
Recommend how both models would be used together in production

Business Context

Dataset

You are given a customer-level dataset built from the last 12 months of activity.

Feature Group	Count	Examples
Demographics	5	age_band, region, device_type, acquisition_channel
Behavioral	12	sessions_30d, avg_session_duration, pages_per_session, cart_additions_30d
Transactional	8	orders_90d, avg_order_value, discount_usage_rate, returns_90d
Engagement	6	email_opens_30d, push_click_rate, days_since_last_visit, wishlist_items
Target	1	purchased_next_30d

Size: 120K customers, 31 input features
Target: Binary label for supervised learning — whether the customer purchases in the next 30 days
Class balance: 28% positive, 72% negative
Missing data: 10% missing in demographics, 6% missing in engagement fields, especially for newly acquired users

Success Criteria

A strong solution should:

Achieve ROC-AUC >= 0.82 for purchase prediction
Produce 3-6 actionable customer segments with clear behavioral differences
Clearly explain the difference between when to use supervised vs. unsupervised learning

Constraints

Predictions will run in a daily batch job on all active customers
Marketing stakeholders need interpretable outputs, not just raw scores
The segmentation approach should be stable enough to refresh monthly

Deliverables

Build a supervised learning pipeline to predict purchased_next_30d
Build an unsupervised learning pipeline to segment customers
Compare the goals, inputs, outputs, and evaluation of both approaches
Explain preprocessing choices for mixed numerical and categorical data
Recommend how both models would be used together in production

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify and Segment Retail Customers

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Classify and Segment Retail Customers

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify and Segment Retail Customers

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer