Classify Loan Default vs Predict Loss

Business Context

FinSure, a consumer lending platform processing ~250K loan applications per quarter, wants to standardize how its risk team chooses between classification and regression models. For each funded loan, the team tracks both whether the borrower defaults within 12 months and the total dollar loss if default occurs.

Dataset

You are given a historical loan dataset and must build two separate supervised learning models on the same feature set:

a classification model to predict whether a loan will default, and
a regression model to predict expected loss amount in dollars.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, home_ownership
Credit history	8	fico_score, delinquencies_2y, revolving_utilization
Loan attributes	7	loan_amount, interest_rate, term_months, purpose
Income & affordability	5	annual_income, dti_ratio, verified_income
Behavioral / bureau flags	4	recent_inquiries, prior_defaults, bankruptcies

Size: 180K funded loans, 30 features
Targets:
- default_12m: binary target (1 = default within 12 months, 0 = no default)
- loss_amount_usd: continuous target, highly right-skewed with many zeros for non-defaulted loans
Missing data: 3-12% missing across income verification and bureau fields
Target distribution:
- Default rate: 11.5%
- Loss amount: median $0, 95th percentile ~$8,700

Success Criteria

A strong solution should clearly explain the difference between classification and regression through model choice, outputs, loss functions, and evaluation metrics. The classification model should achieve ROC-AUC > 0.78 and the regression model should achieve MAE < $1,150 on the holdout set.

Constraints

Risk analysts need interpretable outputs and stable feature effects
Batch scoring must finish in under 10 minutes for 250K rows
The approach should be simple enough to retrain monthly

Deliverables

Train one classification model for default_12m and one regression model for loss_amount_usd
Explain why each target requires a different modeling setup and metric suite
Build preprocessing for mixed numeric/categorical data with missing values
Evaluate both models on a holdout test set and compare against simple baselines
Summarize when the business should use the classification output vs the regression output

Business Context

Dataset

You are given a historical loan dataset and must build two separate supervised learning models on the same feature set:

a classification model to predict whether a loan will default, and
a regression model to predict expected loss amount in dollars.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, home_ownership
Credit history	8	fico_score, delinquencies_2y, revolving_utilization
Loan attributes	7	loan_amount, interest_rate, term_months, purpose
Income & affordability	5	annual_income, dti_ratio, verified_income
Behavioral / bureau flags	4	recent_inquiries, prior_defaults, bankruptcies

Size: 180K funded loans, 30 features
Targets:
- default_12m: binary target (1 = default within 12 months, 0 = no default)
- loss_amount_usd: continuous target, highly right-skewed with many zeros for non-defaulted loans
Missing data: 3-12% missing across income verification and bureau fields
Target distribution:
- Default rate: 11.5%
- Loss amount: median $0, 95th percentile ~$8,700

Success Criteria

Constraints

Risk analysts need interpretable outputs and stable feature effects
Batch scoring must finish in under 10 minutes for 250K rows
The approach should be simple enough to retrain monthly

Deliverables

Train one classification model for default_12m and one regression model for loss_amount_usd
Explain why each target requires a different modeling setup and metric suite
Build preprocessing for mixed numeric/categorical data with missing values
Evaluate both models on a holdout test set and compare against simple baselines
Summarize when the business should use the classification output vs the regression output

Business Context

Dataset

You are given a historical loan dataset and must build two separate supervised learning models on the same feature set:

a classification model to predict whether a loan will default, and
a regression model to predict expected loss amount in dollars.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, home_ownership
Credit history	8	fico_score, delinquencies_2y, revolving_utilization
Loan attributes	7	loan_amount, interest_rate, term_months, purpose
Income & affordability	5	annual_income, dti_ratio, verified_income
Behavioral / bureau flags	4	recent_inquiries, prior_defaults, bankruptcies

Size: 180K funded loans, 30 features
Targets:
- default_12m: binary target (1 = default within 12 months, 0 = no default)
- loss_amount_usd: continuous target, highly right-skewed with many zeros for non-defaulted loans
Missing data: 3-12% missing across income verification and bureau fields
Target distribution:
- Default rate: 11.5%
- Loss amount: median $0, 95th percentile ~$8,700

Success Criteria

Constraints

Risk analysts need interpretable outputs and stable feature effects
Batch scoring must finish in under 10 minutes for 250K rows
The approach should be simple enough to retrain monthly

Deliverables

Train one classification model for default_12m and one regression model for loss_amount_usd
Explain why each target requires a different modeling setup and metric suite
Build preprocessing for mixed numeric/categorical data with missing values
Evaluate both models on a holdout test set and compare against simple baselines
Summarize when the business should use the classification output vs the regression output

Business Context

Dataset

You are given a historical loan dataset and must build two separate supervised learning models on the same feature set:

a classification model to predict whether a loan will default, and
a regression model to predict expected loss amount in dollars.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, home_ownership
Credit history	8	fico_score, delinquencies_2y, revolving_utilization
Loan attributes	7	loan_amount, interest_rate, term_months, purpose
Income & affordability	5	annual_income, dti_ratio, verified_income
Behavioral / bureau flags	4	recent_inquiries, prior_defaults, bankruptcies

Size: 180K funded loans, 30 features
Targets:
- default_12m: binary target (1 = default within 12 months, 0 = no default)
- loss_amount_usd: continuous target, highly right-skewed with many zeros for non-defaulted loans
Missing data: 3-12% missing across income verification and bureau fields
Target distribution:
- Default rate: 11.5%
- Loss amount: median $0, 95th percentile ~$8,700

Success Criteria

Constraints

Risk analysts need interpretable outputs and stable feature effects
Batch scoring must finish in under 10 minutes for 250K rows
The approach should be simple enough to retrain monthly

Deliverables

Train one classification model for default_12m and one regression model for loss_amount_usd
Explain why each target requires a different modeling setup and metric suite
Build preprocessing for mixed numeric/categorical data with missing values
Evaluate both models on a holdout test set and compare against simple baselines
Summarize when the business should use the classification output vs the regression output

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify Loan Default vs Predict Loss

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Classify Loan Default vs Predict Loss

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify Loan Default vs Predict Loss

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer