Business Context
You’re on the Risk Modeling team at LendFlow, a fintech lender offering instant point-of-sale loans for e-commerce checkouts. The model’s prediction is used to (a) approve/decline applications and (b) set credit limits. LendFlow processes ~2.5M applications/month across the US and EU, and a 10 bps degradation in default rate translates to ~$4–6M/year in charge-offs. Regulators and internal audit require that decisions are explainable and that the model is monitored for drift and bias.
Your current baseline is a tuned logistic regression. Leadership wants you to evaluate whether a Deep Neural Network (DNN) can outperform XGBoost on this tabular dataset, and if so, whether the operational and compliance trade-offs are worth it.
Dataset
You have 12 months of historical applications with outcomes observed over a 90-day window.
| Feature Group | Count | Examples | Notes |
|---|
| Applicant demographics | 8 | age_bucket, region, employment_status | Some features restricted in EU; must support feature gating |
| Credit bureau aggregates | 22 | tradelines_count, utilization_pct, delinquencies_12m | Strong predictors; missing for ~6% (thin-file) |
| Transaction & bank-link | 18 | avg_balance_30d, inflow_std_90d, nsf_events_90d | Missing for ~35% (user didn’t link bank) |
| Merchant & product | 10 | merchant_category, cart_amount, sku_risk_score | High-cardinality categorical |
| Device & fraud signals | 12 | device_age_days, ip_risk_score, velocity_1h | Noisy; distribution shifts during promos |
- Size: ~30M rows, 70 features (after basic cleaning)
- Target: Binary —
default_90d (1 if charged-off/90+ DPD within 90 days, else 0)
- Class balance: Imbalanced — ~2.2% positive
- Missing data: Structured missingness (bank-link missing not at random), plus sporadic nulls in bureau fields
Success Criteria
- Risk performance: Improve ranking quality measured by AUC-PR and Recall at fixed FPR.
- Operating point: At FPR = 3%, achieve Recall ≥ 55% on a held-out month.
- Calibration: Predicted probabilities must be usable for limit setting; ECE ≤ 0.02 after calibration.
- Latency: p95 < 40 ms per application for online scoring (model-only; features are precomputed).
- Explainability: Provide reason codes for adverse action (top contributing features) and a defensible approach for audit.
Constraints
- Temporal leakage risk: Outcomes occur after application; features must be strictly available at decision time.
- Non-stationarity: Merchant mix and fraud signals drift during seasonal events.
- Compute budget: Training must finish in < 6 hours on a single GPU box or a 32-core CPU machine.
- Deployment: Model served in a Python microservice; prefer ONNX/TensorRT optional but not required.
Deliverables (what you must produce)
- A clear recommendation: XGBoost vs DNN (or a hybrid), tied to the constraints above.
- A training/evaluation plan that avoids leakage (time-based split) and handles imbalance.
- Feature preprocessing strategy for both models (categoricals, missingness, scaling).
- A thresholding and calibration approach to hit the operating point.
- A production plan: inference architecture, monitoring, and retraining cadence.
- A brief discussion of interpretability: SHAP vs integrated gradients vs surrogate models, and how you’d generate reason codes.