Select Models for Learner Completion

Medium

Machine Learning

Asked at 1 company1Supervised LearningCross-ValidationBias-Variance Tradeoff

Also asked at

Problem

Business Context

Data Society wants to predict whether a learner enrolled in a cohort-based training program will complete the course and earn a certificate. The model will be used in Data Society's internal learner success workflow to prioritize outreach for at-risk learners before the final project deadline.

Dataset

You are given a historical dataset of learner enrollments from the last 24 months.

Feature Group	Count	Examples
Demographics	6	region, years_experience, job_function
Enrollment metadata	5	program_track, cohort_size, funding_source, enrollment_channel
Engagement	14	sessions_attended, attendance_rate, days_since_last_login, assignments_submitted
Assessment	8	quiz_avg, project_checkpoint_score, late_submission_count
Support interactions	4	mentor_messages_sent, office_hours_attended
Temporal	5	week_of_program, enrollment_month, days_active_last_14d

Rows: 62,000 learner enrollments
Features: 42 total predictors
Target: completed_certificate (1 if learner completed the program, 0 otherwise)
Class balance: 68% completed, 32% did not complete
Missing data: 12% missing in assessment features for learners who skipped checkpoints; 7% missing in demographics

Success Criteria

A good solution should improve intervention targeting over a naive baseline and support operational use. Aim for ROC-AUC >= 0.82 and F1 >= 0.72 on a held-out test set, while also explaining why the selected model is more appropriate than reasonable alternatives.

Constraints

Predictions run weekly in batch on the active learner population
Stakeholders need interpretable drivers of risk, not just raw scores
Training and scoring should fit within standard Python infrastructure used by Data Society
Avoid leakage from future engagement or post-deadline features

Deliverables

Compare at least three candidate models for this binary classification task.
Justify model selection based on data characteristics, bias-variance tradeoff, and operational constraints.
Build a reproducible training pipeline with preprocessing and hyperparameter tuning.
Evaluate on a held-out test set with appropriate classification metrics.
Provide feature importance or coefficient-based interpretation for the final model.

Problem

Business Context

Dataset

You are given a historical dataset of learner enrollments from the last 24 months.

Feature Group	Count	Examples
Demographics	6	region, years_experience, job_function
Enrollment metadata	5	program_track, cohort_size, funding_source, enrollment_channel
Engagement	14	sessions_attended, attendance_rate, days_since_last_login, assignments_submitted
Assessment	8	quiz_avg, project_checkpoint_score, late_submission_count
Support interactions	4	mentor_messages_sent, office_hours_attended
Temporal	5	week_of_program, enrollment_month, days_active_last_14d

Rows: 62,000 learner enrollments
Features: 42 total predictors
Target: completed_certificate (1 if learner completed the program, 0 otherwise)
Class balance: 68% completed, 32% did not complete
Missing data: 12% missing in assessment features for learners who skipped checkpoints; 7% missing in demographics

Success Criteria

Constraints

Predictions run weekly in batch on the active learner population
Stakeholders need interpretable drivers of risk, not just raw scores
Training and scoring should fit within standard Python infrastructure used by Data Society
Avoid leakage from future engagement or post-deadline features

Deliverables

Compare at least three candidate models for this binary classification task.
Justify model selection based on data characteristics, bias-variance tradeoff, and operational constraints.
Build a reproducible training pipeline with preprocessing and hyperparameter tuning.
Evaluate on a held-out test set with appropriate classification metrics.
Provide feature importance or coefficient-based interpretation for the final model.

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Select Loan Default Model Under ConstraintsEasy Build a Loan Default ClassifierEasy Select Features for Loan DefaultEasy

Next question

Dataset

You are given a historical dataset of learner enrollments from the last 24 months.

Feature Group	Count	Examples
Demographics	6	region, years_experience, job_function
Enrollment metadata	5	program_track, cohort_size, funding_source, enrollment_channel
Engagement	14	sessions_attended, attendance_rate, days_since_last_login, assignments_submitted
Assessment	8	quiz_avg, project_checkpoint_score, late_submission_count
Support interactions	4	mentor_messages_sent, office_hours_attended
Temporal	5	week_of_program, enrollment_month, days_active_last_14d

Rows: 62,000 learner enrollments

Features: 42 total predictors

Target: completed_certificate (1 if learner completed the program, 0 otherwise)

Class balance: 68% completed, 32% did not complete

Missing data: 12% missing in assessment features for learners who skipped checkpoints; 7% missing in demographics

Deliverables

Compare at least three candidate models for this binary classification task.

Justify model selection based on data characteristics, bias-variance tradeoff, and operational constraints.

Build a reproducible training pipeline with preprocessing and hyperparameter tuning.

Evaluate on a held-out test set with appropriate classification metrics.

Provide feature importance or coefficient-based interpretation for the final model.

Dataset

You are given a historical dataset of learner enrollments from the last 24 months.

Feature Group	Count	Examples
Demographics	6	region, years_experience, job_function
Enrollment metadata	5	program_track, cohort_size, funding_source, enrollment_channel
Engagement	14	sessions_attended, attendance_rate, days_since_last_login, assignments_submitted
Assessment	8	quiz_avg, project_checkpoint_score, late_submission_count
Support interactions	4	mentor_messages_sent, office_hours_attended
Temporal	5	week_of_program, enrollment_month, days_active_last_14d

Rows: 62,000 learner enrollments

Features: 42 total predictors

Target: completed_certificate (1 if learner completed the program, 0 otherwise)

Class balance: 68% completed, 32% did not complete

Missing data: 12% missing in assessment features for learners who skipped checkpoints; 7% missing in demographics

Deliverables

Compare at least three candidate models for this binary classification task.

Justify model selection based on data characteristics, bias-variance tradeoff, and operational constraints.

Build a reproducible training pipeline with preprocessing and hyperparameter tuning.

Evaluate on a held-out test set with appropriate classification metrics.

Provide feature importance or coefficient-based interpretation for the final model.