Select Features for Loan Default

Business Context

LendWise, a mid-size digital lender processing roughly 300K consumer loan applications per year, wants a default-risk model for underwriting. The current model uses nearly every available column, causing unstable performance, weak interpretability, and avoidable retraining issues when upstream fields change.

Dataset

You are given a historical loan application dataset and asked to design a feature selection strategy for a binary classification model predicting whether an approved loan will default within 12 months.

Feature Group	Count	Examples
Applicant demographics	8	age, region, employment_type, years_at_job
Credit bureau signals	12	credit_score, delinquencies_12m, utilization_rate, open_accounts
Financials	10	annual_income, debt_to_income, monthly_obligations, savings_balance
Application metadata	6	channel, device_type, application_hour, repeat_customer
Engineered candidates	9	income_per_open_account, recent_delinquency_flag, utilization_bucket

Size: 240K loan applications, 45 candidate features
Target: Binary — default within 12 months (1) vs no default (0)
Class balance: 14% positive, 86% negative
Missing data: 18% missing in savings_balance, 9% in employment fields, <3% elsewhere

Success Criteria

A good solution should improve generalization while reducing the feature set to a maintainable subset. Target performance is ROC-AUC e 0.80 and PR-AUC e 0.42 on a held-out test set, with no more than 20 selected features and a clear explanation of why they were kept or removed.

Constraints

The underwriting team requires feature-level interpretability.
Inference must complete in under 50 ms per application.
Avoid features that are expensive to collect or likely to be unavailable at scoring time.
The selected feature set should be stable enough for monthly retraining.

Deliverables

Define a feature selection methodology combining business rules and statistical/model-based methods.
Build a baseline model and a reduced-feature model.
Compare performance using cross-validation and a final holdout test set.
Explain how you handle missing values, correlated variables, and leakage risks.
Provide the final selected feature list and justify tradeoffs between accuracy and simplicity.

Business Context

Dataset

Feature Group	Count	Examples
Applicant demographics	8	age, region, employment_type, years_at_job
Credit bureau signals	12	credit_score, delinquencies_12m, utilization_rate, open_accounts
Financials	10	annual_income, debt_to_income, monthly_obligations, savings_balance
Application metadata	6	channel, device_type, application_hour, repeat_customer
Engineered candidates	9	income_per_open_account, recent_delinquency_flag, utilization_bucket

Size: 240K loan applications, 45 candidate features
Target: Binary — default within 12 months (1) vs no default (0)
Class balance: 14% positive, 86% negative
Missing data: 18% missing in savings_balance, 9% in employment fields, <3% elsewhere

Success Criteria

Constraints

The underwriting team requires feature-level interpretability.
Inference must complete in under 50 ms per application.
Avoid features that are expensive to collect or likely to be unavailable at scoring time.
The selected feature set should be stable enough for monthly retraining.

Deliverables

Define a feature selection methodology combining business rules and statistical/model-based methods.
Build a baseline model and a reduced-feature model.
Compare performance using cross-validation and a final holdout test set.
Explain how you handle missing values, correlated variables, and leakage risks.
Provide the final selected feature list and justify tradeoffs between accuracy and simplicity.

Business Context

Dataset

Feature Group	Count	Examples
Applicant demographics	8	age, region, employment_type, years_at_job
Credit bureau signals	12	credit_score, delinquencies_12m, utilization_rate, open_accounts
Financials	10	annual_income, debt_to_income, monthly_obligations, savings_balance
Application metadata	6	channel, device_type, application_hour, repeat_customer
Engineered candidates	9	income_per_open_account, recent_delinquency_flag, utilization_bucket

Size: 240K loan applications, 45 candidate features
Target: Binary — default within 12 months (1) vs no default (0)
Class balance: 14% positive, 86% negative
Missing data: 18% missing in savings_balance, 9% in employment fields, <3% elsewhere

Success Criteria

Constraints

The underwriting team requires feature-level interpretability.
Inference must complete in under 50 ms per application.
Avoid features that are expensive to collect or likely to be unavailable at scoring time.
The selected feature set should be stable enough for monthly retraining.

Deliverables

Define a feature selection methodology combining business rules and statistical/model-based methods.
Build a baseline model and a reduced-feature model.
Compare performance using cross-validation and a final holdout test set.
Explain how you handle missing values, correlated variables, and leakage risks.
Provide the final selected feature list and justify tradeoffs between accuracy and simplicity.

Business Context

Dataset

Feature Group	Count	Examples
Applicant demographics	8	age, region, employment_type, years_at_job
Credit bureau signals	12	credit_score, delinquencies_12m, utilization_rate, open_accounts
Financials	10	annual_income, debt_to_income, monthly_obligations, savings_balance
Application metadata	6	channel, device_type, application_hour, repeat_customer
Engineered candidates	9	income_per_open_account, recent_delinquency_flag, utilization_bucket

Size: 240K loan applications, 45 candidate features
Target: Binary — default within 12 months (1) vs no default (0)
Class balance: 14% positive, 86% negative
Missing data: 18% missing in savings_balance, 9% in employment fields, <3% elsewhere

Success Criteria

Constraints

The underwriting team requires feature-level interpretability.
Inference must complete in under 50 ms per application.
Avoid features that are expensive to collect or likely to be unavailable at scoring time.
The selected feature set should be stable enough for monthly retraining.

Deliverables

Define a feature selection methodology combining business rules and statistical/model-based methods.
Build a baseline model and a reduced-feature model.
Compare performance using cross-validation and a final holdout test set.
Explain how you handle missing values, correlated variables, and leakage risks.
Provide the final selected feature list and justify tradeoffs between accuracy and simplicity.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Features for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Select Features for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Features for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer