Segment and Predict Retail Customer Behavior

Business Context

Northstar Retail, an online marketplace with 1.2M customers, wants to better understand customer behavior and improve repeat purchase targeting. The analytics team needs both customer segments for marketing strategy and a predictive model for whether a customer will make a purchase in the next 30 days.

Dataset

You are given a customer-level dataset built from 12 months of transaction and engagement history.

Feature Group	Count	Examples
Demographics	5	age_bucket, region, acquisition_channel, device_type, loyalty_tier
Transaction history	10	total_orders, avg_order_value, days_since_last_order, refund_rate
Engagement	8	email_open_rate, app_sessions_30d, product_views_30d, cart_add_rate
Support & returns	4	support_tickets_90d, return_count_90d, avg_resolution_time
Target	1	purchased_next_30d

Size: 240K customers, 27 input features
Target: Binary label indicating whether the customer makes at least one purchase in the next 30 days
Class balance: 28% positive, 72% negative
Missing data: 12% missing in engagement fields for email-unsubscribed users, 6% missing in demographics

Success Criteria

A strong solution should:

Build meaningful customer segments using an unsupervised method
Train a supervised model that achieves ROC-AUC of at least 0.82 on a held-out test set
Explain the practical difference between supervised and unsupervised learning using this dataset and the resulting outputs

Constraints

Marketing needs interpretable segments for campaign design
Scoring must run daily on 240K customers in under 10 minutes
The solution should use standard Python ML tooling and be maintainable by a small data team

Deliverables

Build an unsupervised learning pipeline to cluster customers and describe each segment
Build a supervised learning pipeline to predict purchased_next_30d
Compare the objectives, inputs, outputs, and evaluation of both approaches
Recommend how both models would be used together in production

Business Context

Dataset

You are given a customer-level dataset built from 12 months of transaction and engagement history.

Feature Group	Count	Examples
Demographics	5	age_bucket, region, acquisition_channel, device_type, loyalty_tier
Transaction history	10	total_orders, avg_order_value, days_since_last_order, refund_rate
Engagement	8	email_open_rate, app_sessions_30d, product_views_30d, cart_add_rate
Support & returns	4	support_tickets_90d, return_count_90d, avg_resolution_time
Target	1	purchased_next_30d

Size: 240K customers, 27 input features
Target: Binary label indicating whether the customer makes at least one purchase in the next 30 days
Class balance: 28% positive, 72% negative
Missing data: 12% missing in engagement fields for email-unsubscribed users, 6% missing in demographics

Success Criteria

A strong solution should:

Build meaningful customer segments using an unsupervised method
Train a supervised model that achieves ROC-AUC of at least 0.82 on a held-out test set
Explain the practical difference between supervised and unsupervised learning using this dataset and the resulting outputs

Constraints

Marketing needs interpretable segments for campaign design
Scoring must run daily on 240K customers in under 10 minutes
The solution should use standard Python ML tooling and be maintainable by a small data team

Deliverables

Build an unsupervised learning pipeline to cluster customers and describe each segment
Build a supervised learning pipeline to predict purchased_next_30d
Compare the objectives, inputs, outputs, and evaluation of both approaches
Recommend how both models would be used together in production

Business Context

Dataset

You are given a customer-level dataset built from 12 months of transaction and engagement history.

Feature Group	Count	Examples
Demographics	5	age_bucket, region, acquisition_channel, device_type, loyalty_tier
Transaction history	10	total_orders, avg_order_value, days_since_last_order, refund_rate
Engagement	8	email_open_rate, app_sessions_30d, product_views_30d, cart_add_rate
Support & returns	4	support_tickets_90d, return_count_90d, avg_resolution_time
Target	1	purchased_next_30d

Size: 240K customers, 27 input features
Target: Binary label indicating whether the customer makes at least one purchase in the next 30 days
Class balance: 28% positive, 72% negative
Missing data: 12% missing in engagement fields for email-unsubscribed users, 6% missing in demographics

Success Criteria

A strong solution should:

Build meaningful customer segments using an unsupervised method
Train a supervised model that achieves ROC-AUC of at least 0.82 on a held-out test set
Explain the practical difference between supervised and unsupervised learning using this dataset and the resulting outputs

Constraints

Marketing needs interpretable segments for campaign design
Scoring must run daily on 240K customers in under 10 minutes
The solution should use standard Python ML tooling and be maintainable by a small data team

Deliverables

Build an unsupervised learning pipeline to cluster customers and describe each segment
Build a supervised learning pipeline to predict purchased_next_30d
Compare the objectives, inputs, outputs, and evaluation of both approaches
Recommend how both models would be used together in production

Business Context

Dataset

You are given a customer-level dataset built from 12 months of transaction and engagement history.

Feature Group	Count	Examples
Demographics	5	age_bucket, region, acquisition_channel, device_type, loyalty_tier
Transaction history	10	total_orders, avg_order_value, days_since_last_order, refund_rate
Engagement	8	email_open_rate, app_sessions_30d, product_views_30d, cart_add_rate
Support & returns	4	support_tickets_90d, return_count_90d, avg_resolution_time
Target	1	purchased_next_30d

Size: 240K customers, 27 input features
Target: Binary label indicating whether the customer makes at least one purchase in the next 30 days
Class balance: 28% positive, 72% negative
Missing data: 12% missing in engagement fields for email-unsubscribed users, 6% missing in demographics

Success Criteria

A strong solution should:

Build meaningful customer segments using an unsupervised method
Train a supervised model that achieves ROC-AUC of at least 0.82 on a held-out test set
Explain the practical difference between supervised and unsupervised learning using this dataset and the resulting outputs

Constraints

Marketing needs interpretable segments for campaign design
Scoring must run daily on 240K customers in under 10 minutes
The solution should use standard Python ML tooling and be maintainable by a small data team

Deliverables

Build an unsupervised learning pipeline to cluster customers and describe each segment
Build a supervised learning pipeline to predict purchased_next_30d
Compare the objectives, inputs, outputs, and evaluation of both approaches
Recommend how both models would be used together in production

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Segment and Predict Retail Customer Behavior

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Segment and Predict Retail Customer Behavior

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Segment and Predict Retail Customer Behavior

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer