Dataford
Interview Guides
Upgrade
All questions/Machine Learning/Select Models for Learner Completion

Select Models for Learner Completion

Medium
Machine Learning
Asked at 1 company1Supervised LearningCross-ValidationBias-Variance Tradeoff
Also asked at
Data Society

Problem

Business Context

Data Society wants to predict whether a learner enrolled in a cohort-based training program will complete the course and earn a certificate. The model will be used in Data Society's internal learner success workflow to prioritize outreach for at-risk learners before the final project deadline.

Dataset

You are given a historical dataset of learner enrollments from the last 24 months.

Feature GroupCountExamples
Demographics6region, years_experience, job_function
Enrollment metadata5program_track, cohort_size, funding_source, enrollment_channel
Engagement14sessions_attended, attendance_rate, days_since_last_login, assignments_submitted
Assessment8quiz_avg, project_checkpoint_score, late_submission_count
Support interactions4mentor_messages_sent, office_hours_attended
Temporal5week_of_program, enrollment_month, days_active_last_14d
  • Rows: 62,000 learner enrollments
  • Features: 42 total predictors
  • Target: completed_certificate (1 if learner completed the program, 0 otherwise)
  • Class balance: 68% completed, 32% did not complete
  • Missing data: 12% missing in assessment features for learners who skipped checkpoints; 7% missing in demographics

Success Criteria

A good solution should improve intervention targeting over a naive baseline and support operational use. Aim for ROC-AUC >= 0.82 and F1 >= 0.72 on a held-out test set, while also explaining why the selected model is more appropriate than reasonable alternatives.

Constraints

  • Predictions run weekly in batch on the active learner population
  • Stakeholders need interpretable drivers of risk, not just raw scores
  • Training and scoring should fit within standard Python infrastructure used by Data Society
  • Avoid leakage from future engagement or post-deadline features

Deliverables

  1. Compare at least three candidate models for this binary classification task.
  2. Justify model selection based on data characteristics, bias-variance tradeoff, and operational constraints.
  3. Build a reproducible training pipeline with preprocessing and hyperparameter tuning.
  4. Evaluate on a held-out test set with appropriate classification metrics.
  5. Provide feature importance or coefficient-based interpretation for the final model.

Problem

Business Context

Data Society wants to predict whether a learner enrolled in a cohort-based training program will complete the course and earn a certificate. The model will be used in Data Society's internal learner success workflow to prioritize outreach for at-risk learners before the final project deadline.

Dataset

You are given a historical dataset of learner enrollments from the last 24 months.

Feature GroupCountExamples
Demographics6region, years_experience, job_function
Enrollment metadata5program_track, cohort_size, funding_source, enrollment_channel
Engagement14sessions_attended, attendance_rate, days_since_last_login, assignments_submitted
Assessment8quiz_avg, project_checkpoint_score, late_submission_count
Support interactions4mentor_messages_sent, office_hours_attended
Temporal5week_of_program, enrollment_month, days_active_last_14d
  • Rows: 62,000 learner enrollments
  • Features: 42 total predictors
  • Target: completed_certificate (1 if learner completed the program, 0 otherwise)
  • Class balance: 68% completed, 32% did not complete
  • Missing data: 12% missing in assessment features for learners who skipped checkpoints; 7% missing in demographics

Success Criteria

A good solution should improve intervention targeting over a naive baseline and support operational use. Aim for ROC-AUC >= 0.82 and F1 >= 0.72 on a held-out test set, while also explaining why the selected model is more appropriate than reasonable alternatives.

Constraints

  • Predictions run weekly in batch on the active learner population
  • Stakeholders need interpretable drivers of risk, not just raw scores
  • Training and scoring should fit within standard Python infrastructure used by Data Society
  • Avoid leakage from future engagement or post-deadline features

Deliverables

  1. Compare at least three candidate models for this binary classification task.
  2. Justify model selection based on data characteristics, bias-variance tradeoff, and operational constraints.
  3. Build a reproducible training pipeline with preprocessing and hyperparameter tuning.
  4. Evaluate on a held-out test set with appropriate classification metrics.
  5. Provide feature importance or coefficient-based interpretation for the final model.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
Select Loan Default Model Under ConstraintsEasyBuild a Loan Default ClassifierEasySelect Features for Loan DefaultEasy
Next question