Improve Loan Default Prediction Features

Business Context

LendWise, a digital consumer lending platform processing ~200K loan applications per quarter, wants to improve its default-risk model without increasing approval latency. The credit team wants to understand how feature engineering affects model quality, stability, and interpretability.

Dataset

You are given an offline training dataset of historical loan applications and 12-month repayment outcomes.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, residence_type, region
Financial variables	10	annual_income, monthly_debt, credit_utilization, revolving_balance
Credit history	8	fico_band, delinquencies_2y, inquiries_6m, oldest_trade_age_months
Loan attributes	5	loan_amount, term_months, interest_rate, purpose
Behavioral / derived raw fields	7	recent_balance_change, payment_to_income_raw, open_to_buy, utilization_trend_3m

Size: 240K applications, 36 features
Target: default_12m — whether the borrower defaulted within 12 months
Class balance: 14% default, 86% non-default
Missing data: 12% missing in employment_length, 9% in utilization_trend_3m, 4% in annual_income

Success Criteria

A good solution should improve model performance over a raw-feature baseline by using thoughtful feature engineering, while keeping the model explainable enough for risk review. Target at least a 0.03 absolute lift in ROC-AUC or 0.05 lift in PR-AUC versus baseline logistic regression on raw inputs.

Constraints

Batch scoring only; inference per application must stay under 50 ms
Model should remain interpretable for credit policy review
No external data sources
Avoid leakage from post-origination information

Deliverables

Build a baseline model using mostly raw features.
Design engineered features and justify why they should help.
Compare baseline vs engineered-feature performance using cross-validation and a held-out test set.
Explain which engineered features help most and why.
Discuss tradeoffs between predictive lift, complexity, and maintainability.

Business Context

Dataset

You are given an offline training dataset of historical loan applications and 12-month repayment outcomes.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, residence_type, region
Financial variables	10	annual_income, monthly_debt, credit_utilization, revolving_balance
Credit history	8	fico_band, delinquencies_2y, inquiries_6m, oldest_trade_age_months
Loan attributes	5	loan_amount, term_months, interest_rate, purpose
Behavioral / derived raw fields	7	recent_balance_change, payment_to_income_raw, open_to_buy, utilization_trend_3m

Size: 240K applications, 36 features
Target: default_12m — whether the borrower defaulted within 12 months
Class balance: 14% default, 86% non-default
Missing data: 12% missing in employment_length, 9% in utilization_trend_3m, 4% in annual_income

Success Criteria

Constraints

Batch scoring only; inference per application must stay under 50 ms
Model should remain interpretable for credit policy review
No external data sources
Avoid leakage from post-origination information

Deliverables

Build a baseline model using mostly raw features.
Design engineered features and justify why they should help.
Compare baseline vs engineered-feature performance using cross-validation and a held-out test set.
Explain which engineered features help most and why.
Discuss tradeoffs between predictive lift, complexity, and maintainability.

Business Context

Dataset

You are given an offline training dataset of historical loan applications and 12-month repayment outcomes.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, residence_type, region
Financial variables	10	annual_income, monthly_debt, credit_utilization, revolving_balance
Credit history	8	fico_band, delinquencies_2y, inquiries_6m, oldest_trade_age_months
Loan attributes	5	loan_amount, term_months, interest_rate, purpose
Behavioral / derived raw fields	7	recent_balance_change, payment_to_income_raw, open_to_buy, utilization_trend_3m

Size: 240K applications, 36 features
Target: default_12m — whether the borrower defaulted within 12 months
Class balance: 14% default, 86% non-default
Missing data: 12% missing in employment_length, 9% in utilization_trend_3m, 4% in annual_income

Success Criteria

Constraints

Batch scoring only; inference per application must stay under 50 ms
Model should remain interpretable for credit policy review
No external data sources
Avoid leakage from post-origination information

Deliverables

Build a baseline model using mostly raw features.
Design engineered features and justify why they should help.
Compare baseline vs engineered-feature performance using cross-validation and a held-out test set.
Explain which engineered features help most and why.
Discuss tradeoffs between predictive lift, complexity, and maintainability.

Business Context

Dataset

You are given an offline training dataset of historical loan applications and 12-month repayment outcomes.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, residence_type, region
Financial variables	10	annual_income, monthly_debt, credit_utilization, revolving_balance
Credit history	8	fico_band, delinquencies_2y, inquiries_6m, oldest_trade_age_months
Loan attributes	5	loan_amount, term_months, interest_rate, purpose
Behavioral / derived raw fields	7	recent_balance_change, payment_to_income_raw, open_to_buy, utilization_trend_3m

Size: 240K applications, 36 features
Target: default_12m — whether the borrower defaulted within 12 months
Class balance: 14% default, 86% non-default
Missing data: 12% missing in employment_length, 9% in utilization_trend_3m, 4% in annual_income

Success Criteria

Constraints

Batch scoring only; inference per application must stay under 50 ms
Model should remain interpretable for credit policy review
No external data sources
Avoid leakage from post-origination information

Deliverables

Build a baseline model using mostly raw features.
Design engineered features and justify why they should help.
Compare baseline vs engineered-feature performance using cross-validation and a held-out test set.
Explain which engineered features help most and why.
Discuss tradeoffs between predictive lift, complexity, and maintainability.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Improve Loan Default Prediction Features

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Improve Loan Default Prediction Features

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Improve Loan Default Prediction Features

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer