Business Context
The University of Kentucky wants to better support first-year students using data from UK Canvas, myUK, and advising systems. You need to demonstrate the practical difference between supervised and unsupervised learning by solving two related problems on the same student dataset: predicting academic risk and discovering student engagement segments.
Dataset
You are given a term-level dataset covering 24,000 undergraduate students over 6 semesters (about 96,000 student-term records) with behavioral, academic, and demographic features.
| Feature Group | Count | Examples |
|---|
| Academic history | 10 | prior_gpa, credits_attempted, credits_completed, dropped_courses |
| LMS engagement (UK Canvas) | 12 | weekly_logins, assignment_submissions, discussion_posts, late_submissions |
| Advising & support | 6 | advising_visits, tutoring_sessions, holds_count, financial_aid_changes |
| Enrollment & demographics | 8 | residency_status, major_college, class_year, first_gen_flag |
| Temporal features | 5 | week_of_term aggregates, trend in logins, trend in grades |
- Target for supervised task:
risk_flag = 1 if student ends the term with GPA < 2.0 or withdraws, else 0
- Unsupervised task: no label; identify meaningful student segments for intervention design
- Class balance: about 18% positive for
risk_flag
- Missing data: 12% missing in advising/tutoring features, 4% missing in LMS features for late-added courses
Success Criteria
A strong solution should:
- Achieve ROC-AUC >= 0.82 and F1 >= 0.55 on the supervised risk model
- Produce 3-6 interpretable clusters with a silhouette score >= 0.20 for the unsupervised task
- Clearly explain when supervised learning is appropriate versus when unsupervised learning is more useful
Constraints
- Predictions will be scored weekly in batch for about 30,000 active students
- Student success staff need interpretable outputs, not a black-box-only solution
- FERPA-sensitive data should be minimized in production features
Deliverables
- Build a supervised classification model to predict
risk_flag
- Build an unsupervised clustering model to segment students
- Compare the two approaches: objective, inputs, outputs, evaluation, and business use
- Recommend which approach should power early-alert workflows in UK advising
- Provide feature importance and cluster profiles suitable for non-technical stakeholders