Select Features for Loan Default

Business Context

LendWise, a mid-size digital lender processing ~300K consumer loan applications per year, wants a simpler and more stable default-risk model. The current model uses every available field, which has increased training time, reduced interpretability, and introduced noisy predictors.

Dataset

You are given a historical loan application dataset for a binary classification problem: predict whether an approved loan will default within 12 months.

Feature Group	Count	Examples
Applicant demographics	8	age, employment_length, home_ownership, marital_status
Credit bureau variables	14	fico_score, revolving_utilization, delinquencies_2y, inquiries_6m
Financial variables	10	annual_income, debt_to_income, loan_amount, installment
Application metadata	6	channel, state, application_hour, repeat_customer
Engineered candidates	12	income_to_loan_ratio, utilization_bucket, recent_inquiry_rate

Rows: 120K approved loans from the last 24 months
Target: default_12m (1 = defaulted within 12 months, 0 = repaid/current)
Class balance: 18% positive, 82% negative
Missing data: ~9% missing in employment and income-related fields; ~4% missing in bureau variables

Success Criteria

A good solution should reduce the feature set meaningfully while maintaining predictive quality. Target performance is ROC-AUC = 0.78 and F1 = 0.50 on the holdout set, with no more than a 2-point ROC-AUC drop versus a full-feature baseline.

Constraints

The risk team needs a feature selection approach that is explainable and reproducible.
Training must complete within 30 minutes on a standard CPU machine.
The final feature set should be stable enough for monthly retraining and model governance reviews.

Deliverables

Build a baseline model using all candidate features.
Implement at least two feature selection methods (for example: filter, embedded, or wrapper methods).
Compare selected feature sets using cross-validated performance and holdout metrics.
Recommend a final feature set and explain why it is appropriate for production.
Identify risks such as leakage, correlated variables, and instability across retraining windows.

Business Context

Dataset

You are given a historical loan application dataset for a binary classification problem: predict whether an approved loan will default within 12 months.

Feature Group	Count	Examples
Applicant demographics	8	age, employment_length, home_ownership, marital_status
Credit bureau variables	14	fico_score, revolving_utilization, delinquencies_2y, inquiries_6m
Financial variables	10	annual_income, debt_to_income, loan_amount, installment
Application metadata	6	channel, state, application_hour, repeat_customer
Engineered candidates	12	income_to_loan_ratio, utilization_bucket, recent_inquiry_rate

Rows: 120K approved loans from the last 24 months
Target: default_12m (1 = defaulted within 12 months, 0 = repaid/current)
Class balance: 18% positive, 82% negative
Missing data: ~9% missing in employment and income-related fields; ~4% missing in bureau variables

Success Criteria

Constraints

The risk team needs a feature selection approach that is explainable and reproducible.
Training must complete within 30 minutes on a standard CPU machine.
The final feature set should be stable enough for monthly retraining and model governance reviews.

Deliverables

Build a baseline model using all candidate features.
Implement at least two feature selection methods (for example: filter, embedded, or wrapper methods).
Compare selected feature sets using cross-validated performance and holdout metrics.
Recommend a final feature set and explain why it is appropriate for production.
Identify risks such as leakage, correlated variables, and instability across retraining windows.

Business Context

Dataset

You are given a historical loan application dataset for a binary classification problem: predict whether an approved loan will default within 12 months.

Feature Group	Count	Examples
Applicant demographics	8	age, employment_length, home_ownership, marital_status
Credit bureau variables	14	fico_score, revolving_utilization, delinquencies_2y, inquiries_6m
Financial variables	10	annual_income, debt_to_income, loan_amount, installment
Application metadata	6	channel, state, application_hour, repeat_customer
Engineered candidates	12	income_to_loan_ratio, utilization_bucket, recent_inquiry_rate

Rows: 120K approved loans from the last 24 months
Target: default_12m (1 = defaulted within 12 months, 0 = repaid/current)
Class balance: 18% positive, 82% negative
Missing data: ~9% missing in employment and income-related fields; ~4% missing in bureau variables

Success Criteria

Constraints

The risk team needs a feature selection approach that is explainable and reproducible.
Training must complete within 30 minutes on a standard CPU machine.
The final feature set should be stable enough for monthly retraining and model governance reviews.

Deliverables

Build a baseline model using all candidate features.
Implement at least two feature selection methods (for example: filter, embedded, or wrapper methods).
Compare selected feature sets using cross-validated performance and holdout metrics.
Recommend a final feature set and explain why it is appropriate for production.
Identify risks such as leakage, correlated variables, and instability across retraining windows.

Business Context

Dataset

You are given a historical loan application dataset for a binary classification problem: predict whether an approved loan will default within 12 months.

Feature Group	Count	Examples
Applicant demographics	8	age, employment_length, home_ownership, marital_status
Credit bureau variables	14	fico_score, revolving_utilization, delinquencies_2y, inquiries_6m
Financial variables	10	annual_income, debt_to_income, loan_amount, installment
Application metadata	6	channel, state, application_hour, repeat_customer
Engineered candidates	12	income_to_loan_ratio, utilization_bucket, recent_inquiry_rate

Rows: 120K approved loans from the last 24 months
Target: default_12m (1 = defaulted within 12 months, 0 = repaid/current)
Class balance: 18% positive, 82% negative
Missing data: ~9% missing in employment and income-related fields; ~4% missing in bureau variables

Success Criteria

Constraints

The risk team needs a feature selection approach that is explainable and reproducible.
Training must complete within 30 minutes on a standard CPU machine.
The final feature set should be stable enough for monthly retraining and model governance reviews.

Deliverables

Build a baseline model using all candidate features.
Implement at least two feature selection methods (for example: filter, embedded, or wrapper methods).
Compare selected feature sets using cross-validated performance and holdout metrics.
Recommend a final feature set and explain why it is appropriate for production.
Identify risks such as leakage, correlated variables, and instability across retraining windows.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Features for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Select Features for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Features for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer