Handle Missing Data for Loan Default

Business Context

LendWise, a mid-size digital lender processing ~120K personal loan applications per month, wants a default-risk model for pre-approval decisions. The current rule-based system drops rows with null values, reducing coverage and potentially biasing decisions.

Dataset

You are given a historical application dataset for binary classification: predict whether an applicant will default within 12 months of origination.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_status, education_level, region
Financial profile	9	annual_income, debt_to_income, revolving_utilization, existing_loans
Credit bureau	7	credit_score, delinquency_count_12m, inquiries_6m, oldest_trade_age
Application metadata	5	channel, requested_amount, loan_term, application_hour
Derived flags	4	income_missing_flag, bureau_missing_flag, self_employed_flag

Size: 240K loan applications, 31 input features
Target: default_12m (1 = default, 0 = non-default)
Class balance: 14% positive, 86% negative
Missing data:
- annual_income: 18% missing
- credit_score: 11% missing
- employment_status: 6% missing
- debt_to_income: 9% missing
- Several bureau fields are jointly missing for thin-file applicants

Success Criteria

A good solution should improve over row-dropping and simple mean imputation baselines, achieve ROC-AUC >= 0.78, and maintain stable performance across applicants with and without missing bureau data.

Constraints

Batch scoring must complete in under 10 minutes for 120K applications
The risk team needs feature-level explanations and a clear missing-data strategy
No target leakage from post-origination fields

Deliverables

Build a classification pipeline that handles missing numerical and categorical values correctly
Compare at least two missing-data strategies and justify the final choice
Evaluate the model with appropriate classification metrics and subgroup checks
Explain how you would detect when missingness patterns change in production
Provide production-ready Python code for preprocessing, training, and evaluation

Business Context

Dataset

You are given a historical application dataset for binary classification: predict whether an applicant will default within 12 months of origination.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_status, education_level, region
Financial profile	9	annual_income, debt_to_income, revolving_utilization, existing_loans
Credit bureau	7	credit_score, delinquency_count_12m, inquiries_6m, oldest_trade_age
Application metadata	5	channel, requested_amount, loan_term, application_hour
Derived flags	4	income_missing_flag, bureau_missing_flag, self_employed_flag

Size: 240K loan applications, 31 input features
Target: default_12m (1 = default, 0 = non-default)
Class balance: 14% positive, 86% negative
Missing data:
- annual_income: 18% missing
- credit_score: 11% missing
- employment_status: 6% missing
- debt_to_income: 9% missing
- Several bureau fields are jointly missing for thin-file applicants

Success Criteria

Constraints

Batch scoring must complete in under 10 minutes for 120K applications
The risk team needs feature-level explanations and a clear missing-data strategy
No target leakage from post-origination fields

Deliverables

Build a classification pipeline that handles missing numerical and categorical values correctly
Compare at least two missing-data strategies and justify the final choice
Evaluate the model with appropriate classification metrics and subgroup checks
Explain how you would detect when missingness patterns change in production
Provide production-ready Python code for preprocessing, training, and evaluation

Business Context

Dataset

You are given a historical application dataset for binary classification: predict whether an applicant will default within 12 months of origination.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_status, education_level, region
Financial profile	9	annual_income, debt_to_income, revolving_utilization, existing_loans
Credit bureau	7	credit_score, delinquency_count_12m, inquiries_6m, oldest_trade_age
Application metadata	5	channel, requested_amount, loan_term, application_hour
Derived flags	4	income_missing_flag, bureau_missing_flag, self_employed_flag

Size: 240K loan applications, 31 input features
Target: default_12m (1 = default, 0 = non-default)
Class balance: 14% positive, 86% negative
Missing data:
- annual_income: 18% missing
- credit_score: 11% missing
- employment_status: 6% missing
- debt_to_income: 9% missing
- Several bureau fields are jointly missing for thin-file applicants

Success Criteria

Constraints

Batch scoring must complete in under 10 minutes for 120K applications
The risk team needs feature-level explanations and a clear missing-data strategy
No target leakage from post-origination fields

Deliverables

Build a classification pipeline that handles missing numerical and categorical values correctly
Compare at least two missing-data strategies and justify the final choice
Evaluate the model with appropriate classification metrics and subgroup checks
Explain how you would detect when missingness patterns change in production
Provide production-ready Python code for preprocessing, training, and evaluation

Business Context

Dataset

You are given a historical application dataset for binary classification: predict whether an applicant will default within 12 months of origination.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_status, education_level, region
Financial profile	9	annual_income, debt_to_income, revolving_utilization, existing_loans
Credit bureau	7	credit_score, delinquency_count_12m, inquiries_6m, oldest_trade_age
Application metadata	5	channel, requested_amount, loan_term, application_hour
Derived flags	4	income_missing_flag, bureau_missing_flag, self_employed_flag

Size: 240K loan applications, 31 input features
Target: default_12m (1 = default, 0 = non-default)
Class balance: 14% positive, 86% negative
Missing data:
- annual_income: 18% missing
- credit_score: 11% missing
- employment_status: 6% missing
- debt_to_income: 9% missing
- Several bureau fields are jointly missing for thin-file applicants

Success Criteria

Constraints

Batch scoring must complete in under 10 minutes for 120K applications
The risk team needs feature-level explanations and a clear missing-data strategy
No target leakage from post-origination fields

Deliverables

Build a classification pipeline that handles missing numerical and categorical values correctly
Compare at least two missing-data strategies and justify the final choice
Evaluate the model with appropriate classification metrics and subgroup checks
Explain how you would detect when missingness patterns change in production
Provide production-ready Python code for preprocessing, training, and evaluation

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Handle Missing Data for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Handle Missing Data for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Handle Missing Data for Loan Default

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer