Select Credit Risk Features

Business Context

NorthStar Bank wants to improve a loan default model used for pre-approval decisions on ~250K consumer loan applications per month. The current model has acceptable AUC but is difficult to maintain because it uses too many weak, redundant, and unstable features.

Dataset

You are given a historical tabular dataset of loan applications and 12-month repayment outcomes. The goal is to design a feature selection strategy that improves generalization, preserves interpretability for risk analysts, and avoids leakage.

Feature Group	Count	Examples
Applicant demographics	8	age, employment_status, residential_status
Financial attributes	14	annual_income, debt_to_income, revolving_utilization
Credit bureau variables	19	delinquency_count_12m, inquiries_6m, oldest_trade_age
Application metadata	7	channel, product_type, requested_amount
Engineered aggregates	12	income_per_open_trade, utilization_trend_3m

Size: 420K applications, 60 candidate features
Target: Binary — default within 12 months
Class balance: 11.5% default, 88.5% non-default
Missing data: 18% missing in bureau variables for thin-file applicants; 6% missing in income-related fields

Success Criteria

A strong solution should:

Improve validation performance over a regularized logistic regression baseline
Reduce the feature set to a smaller, defensible subset without materially hurting recall
Provide a repeatable selection process that risk and compliance teams can review

Constraints

Final model must remain interpretable enough for adverse action reasoning
Batch scoring latency must stay under 50 ms per application
Features unavailable at application time cannot be used
The bank prefers a stable feature set retrained quarterly, not weekly

Deliverables

Define a feature selection framework, including leakage checks and handling of correlated variables.
Build a baseline and at least one selected-feature model.
Compare filter, embedded, or wrapper-style selection methods and justify the final choice.
Report evaluation metrics on a held-out test set.
Summarize the final selected features and explain why they were retained or removed.

Business Context

Dataset

Feature Group	Count	Examples
Applicant demographics	8	age, employment_status, residential_status
Financial attributes	14	annual_income, debt_to_income, revolving_utilization
Credit bureau variables	19	delinquency_count_12m, inquiries_6m, oldest_trade_age
Application metadata	7	channel, product_type, requested_amount
Engineered aggregates	12	income_per_open_trade, utilization_trend_3m

Size: 420K applications, 60 candidate features
Target: Binary — default within 12 months
Class balance: 11.5% default, 88.5% non-default
Missing data: 18% missing in bureau variables for thin-file applicants; 6% missing in income-related fields

Success Criteria

A strong solution should:

Improve validation performance over a regularized logistic regression baseline
Reduce the feature set to a smaller, defensible subset without materially hurting recall
Provide a repeatable selection process that risk and compliance teams can review

Constraints

Final model must remain interpretable enough for adverse action reasoning
Batch scoring latency must stay under 50 ms per application
Features unavailable at application time cannot be used
The bank prefers a stable feature set retrained quarterly, not weekly

Deliverables

Define a feature selection framework, including leakage checks and handling of correlated variables.
Build a baseline and at least one selected-feature model.
Compare filter, embedded, or wrapper-style selection methods and justify the final choice.
Report evaluation metrics on a held-out test set.
Summarize the final selected features and explain why they were retained or removed.

Business Context

Dataset

Feature Group	Count	Examples
Applicant demographics	8	age, employment_status, residential_status
Financial attributes	14	annual_income, debt_to_income, revolving_utilization
Credit bureau variables	19	delinquency_count_12m, inquiries_6m, oldest_trade_age
Application metadata	7	channel, product_type, requested_amount
Engineered aggregates	12	income_per_open_trade, utilization_trend_3m

Size: 420K applications, 60 candidate features
Target: Binary — default within 12 months
Class balance: 11.5% default, 88.5% non-default
Missing data: 18% missing in bureau variables for thin-file applicants; 6% missing in income-related fields

Success Criteria

A strong solution should:

Improve validation performance over a regularized logistic regression baseline
Reduce the feature set to a smaller, defensible subset without materially hurting recall
Provide a repeatable selection process that risk and compliance teams can review

Constraints

Final model must remain interpretable enough for adverse action reasoning
Batch scoring latency must stay under 50 ms per application
Features unavailable at application time cannot be used
The bank prefers a stable feature set retrained quarterly, not weekly

Deliverables

Define a feature selection framework, including leakage checks and handling of correlated variables.
Build a baseline and at least one selected-feature model.
Compare filter, embedded, or wrapper-style selection methods and justify the final choice.
Report evaluation metrics on a held-out test set.
Summarize the final selected features and explain why they were retained or removed.

Business Context

Dataset

Feature Group	Count	Examples
Applicant demographics	8	age, employment_status, residential_status
Financial attributes	14	annual_income, debt_to_income, revolving_utilization
Credit bureau variables	19	delinquency_count_12m, inquiries_6m, oldest_trade_age
Application metadata	7	channel, product_type, requested_amount
Engineered aggregates	12	income_per_open_trade, utilization_trend_3m

Size: 420K applications, 60 candidate features
Target: Binary — default within 12 months
Class balance: 11.5% default, 88.5% non-default
Missing data: 18% missing in bureau variables for thin-file applicants; 6% missing in income-related fields

Success Criteria

A strong solution should:

Improve validation performance over a regularized logistic regression baseline
Reduce the feature set to a smaller, defensible subset without materially hurting recall
Provide a repeatable selection process that risk and compliance teams can review

Constraints

Final model must remain interpretable enough for adverse action reasoning
Batch scoring latency must stay under 50 ms per application
Features unavailable at application time cannot be used
The bank prefers a stable feature set retrained quarterly, not weekly

Deliverables

Define a feature selection framework, including leakage checks and handling of correlated variables.
Build a baseline and at least one selected-feature model.
Compare filter, embedded, or wrapper-style selection methods and justify the final choice.
Report evaluation metrics on a held-out test set.
Summarize the final selected features and explain why they were retained or removed.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Credit Risk Features

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Select Credit Risk Features

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Credit Risk Features

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer