Business Context
Collabera wants to predict whether a consultant will become attrition-risk in the next 60 days so account managers can intervene early. You are given historical consultant-level data from Collabera Talent Solutions and must design a feature engineering pipeline that improves model quality while remaining explainable to operations teams.
Dataset
| Feature Group | Count | Examples |
|---|
| Consultant profile | 10 | years_experience, skill_family, location, visa_status |
| Assignment history | 12 | current_project_duration_days, bill_rate, overtime_hours, redeploy_count |
| Engagement signals | 9 | manager_checkins_30d, training_hours_90d, portal_logins_30d |
| HR / payroll | 8 | payment_delay_days, benefits_enrolled, leave_days_90d |
| Temporal fields | 6 | assignment_start_date, last_checkin_date, tenure_days |
- Size: 142K consultant-month records across 24 months, 45 raw columns
- Target: Binary label indicating whether the consultant exits or becomes unassigned within 60 days
- Class balance: 11.6% positive, 88.4% negative
- Missing data: 18% missing in engagement fields, 7% in payroll fields, and sparse values for newly onboarded consultants
Success Criteria
A strong solution should improve performance over a raw-feature baseline and reach ROC-AUC >= 0.82 and PR-AUC >= 0.42 on a held-out time-based test set. The feature set should also support clear explanation of the top risk drivers.
Constraints
- Use only information available up to the prediction date; avoid leakage from future assignment outcomes.
- Batch scoring must finish nightly for ~35K active consultants.
- The final feature pipeline should be reproducible and deployable in Collabera's scheduled ML workflow.
- Prefer interpretable engineered features over opaque embeddings.
Deliverables
- Define a feature engineering plan for numerical, categorical, and temporal fields.
- Build a training pipeline with preprocessing, feature generation, and a classification model.
- Compare at least one baseline model against a feature-engineered model.
- Explain how you prevent leakage and validate using time-based splits.
- Report evaluation metrics and identify the most useful engineered features.