Interview Guides

Select Features for Loan Default

Easy

Machine Learning

Business Context

LendWise, a mid-size digital lender processing about 120K consumer loan applications per month, wants a default-risk model that is accurate, explainable, and stable in production. The current challenge is not only training a classifier, but selecting a feature set that improves generalization while remaining interpretable for risk and compliance teams.

Dataset

You are given a historical loan application dataset for binary classification: predict whether an approved loan will default within 12 months.

Feature Group	Count	Examples
Applicant demographics	6	age, region, employment_status
Credit bureau variables	12	credit_score, delinquency_count_12m, utilization_ratio
Financial variables	10	annual_income, debt_to_income, monthly_obligations
Application metadata	7	channel, device_type, application_hour
Engineered behavior features	9	income_to_loan_ratio, recent_inquiry_rate, credit_age_bucket

Size: 240K applications, 44 candidate features
Target: default_12m (1 = defaulted within 12 months, 0 = repaid/no default)
Class balance: 14% positive, 86% negative
Missing data: 18% missing in bureau variables for thin-file applicants, 6% missing in income-related fields

Success Criteria

A good solution should improve validation performance over a simple all-features logistic regression baseline while reducing unnecessary or redundant features. The final feature set should remain explainable enough for model risk review.

Constraints

Prefer interpretable models or interpretable feature selection outputs
Batch scoring only; latency is not strict, but retraining should be practical monthly
Avoid leakage from post-application or post-origination variables

Deliverables

Propose and implement a feature selection strategy for this dataset.
Compare at least two approaches (for example: filter, embedded, or wrapper methods).
Train a baseline and final model using the selected features.
Report validation and test metrics, and explain why features were kept or removed.
Call out risks such as leakage, multicollinearity, instability across folds, and fairness concerns.

Select Features for Loan Default

Easy

Machine Learning

Business Context

Dataset

You are given a historical loan application dataset for binary classification: predict whether an approved loan will default within 12 months.

Feature Group	Count	Examples
Applicant demographics	6	age, region, employment_status
Credit bureau variables	12	credit_score, delinquency_count_12m, utilization_ratio
Financial variables	10	annual_income, debt_to_income, monthly_obligations
Application metadata	7	channel, device_type, application_hour
Engineered behavior features	9	income_to_loan_ratio, recent_inquiry_rate, credit_age_bucket

Size: 240K applications, 44 candidate features
Target: default_12m (1 = defaulted within 12 months, 0 = repaid/no default)
Class balance: 14% positive, 86% negative
Missing data: 18% missing in bureau variables for thin-file applicants, 6% missing in income-related fields

Success Criteria

Constraints

Prefer interpretable models or interpretable feature selection outputs
Batch scoring only; latency is not strict, but retraining should be practical monthly
Avoid leakage from post-application or post-origination variables

Deliverables

Propose and implement a feature selection strategy for this dataset.
Compare at least two approaches (for example: filter, embedded, or wrapper methods).
Train a baseline and final model using the selected features.
Report validation and test metrics, and explain why features were kept or removed.
Call out risks such as leakage, multicollinearity, instability across folds, and fairness concerns.

Your Answer

Select Features for Loan Default

Easy

Machine Learning

Business Context

Dataset

You are given a historical loan application dataset for binary classification: predict whether an approved loan will default within 12 months.

Feature Group	Count	Examples
Applicant demographics	6	age, region, employment_status
Credit bureau variables	12	credit_score, delinquency_count_12m, utilization_ratio
Financial variables	10	annual_income, debt_to_income, monthly_obligations
Application metadata	7	channel, device_type, application_hour
Engineered behavior features	9	income_to_loan_ratio, recent_inquiry_rate, credit_age_bucket

Size: 240K applications, 44 candidate features
Target: default_12m (1 = defaulted within 12 months, 0 = repaid/no default)
Class balance: 14% positive, 86% negative
Missing data: 18% missing in bureau variables for thin-file applicants, 6% missing in income-related fields

Success Criteria

Constraints

Prefer interpretable models or interpretable feature selection outputs
Batch scoring only; latency is not strict, but retraining should be practical monthly
Avoid leakage from post-application or post-origination variables

Deliverables

Propose and implement a feature selection strategy for this dataset.
Compare at least two approaches (for example: filter, embedded, or wrapper methods).
Train a baseline and final model using the selected features.
Report validation and test metrics, and explain why features were kept or removed.
Call out risks such as leakage, multicollinearity, instability across folds, and fairness concerns.

Select Features for Loan Default

Easy

Machine Learning

Business Context

Dataset

You are given a historical loan application dataset for binary classification: predict whether an approved loan will default within 12 months.

Feature Group	Count	Examples
Applicant demographics	6	age, region, employment_status
Credit bureau variables	12	credit_score, delinquency_count_12m, utilization_ratio
Financial variables	10	annual_income, debt_to_income, monthly_obligations
Application metadata	7	channel, device_type, application_hour
Engineered behavior features	9	income_to_loan_ratio, recent_inquiry_rate, credit_age_bucket

Size: 240K applications, 44 candidate features
Target: default_12m (1 = defaulted within 12 months, 0 = repaid/no default)
Class balance: 14% positive, 86% negative
Missing data: 18% missing in bureau variables for thin-file applicants, 6% missing in income-related fields

Success Criteria

Constraints

Prefer interpretable models or interpretable feature selection outputs
Batch scoring only; latency is not strict, but retraining should be practical monthly
Avoid leakage from post-application or post-origination variables

Deliverables

Propose and implement a feature selection strategy for this dataset.
Compare at least two approaches (for example: filter, embedded, or wrapper methods).
Train a baseline and final model using the selected features.
Report validation and test metrics, and explain why features were kept or removed.
Call out risks such as leakage, multicollinearity, instability across folds, and fairness concerns.