Select Features for Loan Default

Business Context

LendWise, a mid-size digital lender processing roughly 200K personal loan applications per quarter, wants a default-risk model that is accurate, stable, and explainable to credit analysts. The current model uses nearly every available column, causing leakage risk, unstable coefficients, and poor generalization.

Dataset

You are given an offline training dataset of historical approved loans and repayment outcomes. The task is to design a feature selection strategy and train a binary classifier that predicts whether a borrower will default within 12 months of origination.

Feature Group	Count	Examples
Applicant demographics	8	age, employment_status, education_level, region
Credit bureau variables	14	fico_score, delinquencies_12m, credit_utilization, inquiries_6m
Financial variables	11	annual_income, debt_to_income, monthly_obligations, savings_balance
Loan attributes	7	loan_amount, term_months, interest_rate, purpose
Derived / noisy variables	10	application_channel_score, partner_flag, duplicate bureau aggregates

Rows: 120K loans over 24 months
Target: default_12m (1 = defaulted within 12 months, 0 = did not default)
Class balance: 14% positive, 86% negative
Missing data: 12% missing in savings/income-related fields, 6% missing in bureau variables, some categorical levels are rare

Success Criteria

A strong solution should improve generalization over a model trained on all raw features, maintain interpretable drivers for risk review, and achieve ROC-AUC >= 0.80 with PR-AUC >= 0.42 on a held-out test set.

Constraints

Credit analysts need a defensible explanation for why features were kept or removed
Avoid target leakage and highly unstable correlated features
Batch scoring must complete in under 5 minutes for 50K applications
Retraining happens monthly, so the pipeline must be reproducible

Deliverables

Build a feature selection pipeline for mixed numerical and categorical data
Compare at least two approaches (e.g., model-based selection vs. regularized baseline)
Explain how you detect leakage, multicollinearity, and low-value features
Evaluate the final model on a held-out test set with concrete metrics
Provide the final selected feature set and rationale

Business Context

Dataset

Feature Group	Count	Examples
Applicant demographics	8	age, employment_status, education_level, region
Credit bureau variables	14	fico_score, delinquencies_12m, credit_utilization, inquiries_6m
Financial variables	11	annual_income, debt_to_income, monthly_obligations, savings_balance
Loan attributes	7	loan_amount, term_months, interest_rate, purpose
Derived / noisy variables	10	application_channel_score, partner_flag, duplicate bureau aggregates

Rows: 120K loans over 24 months
Target: default_12m (1 = defaulted within 12 months, 0 = did not default)
Class balance: 14% positive, 86% negative
Missing data: 12% missing in savings/income-related fields, 6% missing in bureau variables, some categorical levels are rare

Success Criteria

Constraints

Credit analysts need a defensible explanation for why features were kept or removed
Avoid target leakage and highly unstable correlated features
Batch scoring must complete in under 5 minutes for 50K applications
Retraining happens monthly, so the pipeline must be reproducible

Deliverables

Build a feature selection pipeline for mixed numerical and categorical data
Compare at least two approaches (e.g., model-based selection vs. regularized baseline)
Explain how you detect leakage, multicollinearity, and low-value features
Evaluate the final model on a held-out test set with concrete metrics
Provide the final selected feature set and rationale

Business Context

Dataset

Feature Group	Count	Examples
Applicant demographics	8	age, employment_status, education_level, region
Credit bureau variables	14	fico_score, delinquencies_12m, credit_utilization, inquiries_6m
Financial variables	11	annual_income, debt_to_income, monthly_obligations, savings_balance
Loan attributes	7	loan_amount, term_months, interest_rate, purpose
Derived / noisy variables	10	application_channel_score, partner_flag, duplicate bureau aggregates

Rows: 120K loans over 24 months
Target: default_12m (1 = defaulted within 12 months, 0 = did not default)
Class balance: 14% positive, 86% negative
Missing data: 12% missing in savings/income-related fields, 6% missing in bureau variables, some categorical levels are rare

Success Criteria

Constraints

Credit analysts need a defensible explanation for why features were kept or removed
Avoid target leakage and highly unstable correlated features
Batch scoring must complete in under 5 minutes for 50K applications
Retraining happens monthly, so the pipeline must be reproducible

Deliverables

Build a feature selection pipeline for mixed numerical and categorical data
Compare at least two approaches (e.g., model-based selection vs. regularized baseline)
Explain how you detect leakage, multicollinearity, and low-value features
Evaluate the final model on a held-out test set with concrete metrics
Provide the final selected feature set and rationale

Business Context

Dataset

Feature Group	Count	Examples
Applicant demographics	8	age, employment_status, education_level, region
Credit bureau variables	14	fico_score, delinquencies_12m, credit_utilization, inquiries_6m
Financial variables	11	annual_income, debt_to_income, monthly_obligations, savings_balance
Loan attributes	7	loan_amount, term_months, interest_rate, purpose
Derived / noisy variables	10	application_channel_score, partner_flag, duplicate bureau aggregates

Rows: 120K loans over 24 months
Target: default_12m (1 = defaulted within 12 months, 0 = did not default)
Class balance: 14% positive, 86% negative
Missing data: 12% missing in savings/income-related fields, 6% missing in bureau variables, some categorical levels are rare

Success Criteria

Constraints

Credit analysts need a defensible explanation for why features were kept or removed
Avoid target leakage and highly unstable correlated features
Batch scoring must complete in under 5 minutes for 50K applications
Retraining happens monthly, so the pipeline must be reproducible

Deliverables

Build a feature selection pipeline for mixed numerical and categorical data
Compare at least two approaches (e.g., model-based selection vs. regularized baseline)
Explain how you detect leakage, multicollinearity, and low-value features
Evaluate the final model on a held-out test set with concrete metrics
Provide the final selected feature set and rationale

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Features for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Select Features for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Features for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer