Business Context
HomePricePredictor, a real estate analytics firm, aims to improve its predictive model for housing prices in urban areas. The current model shows high variance, resulting in poor generalization to unseen data. The goal is to implement strategies to reduce overfitting and enhance model performance.
Dataset
| Feature Group | Count | Examples |
|---|
| Numerical | 10 | square_footage, num_bedrooms, num_bathrooms, age_of_house |
| Categorical | 5 | neighborhood, property_type, school_rating, has_pool |
| Temporal | 1 | date_sold |
- Size: 20,000 records, 16 features
- Target: Continuous — sale price of houses
- Class balance: N/A (regression problem)
- Missing data: 10% missing values in
school_rating, 5% in has_pool
Requirements
- Implement a regression model to predict housing prices while mitigating overfitting.
- Use LASSO regression to encourage sparsity and reduce model complexity.
- Apply k-fold cross-validation to evaluate the model's performance.
- Analyze and report the impact of regularization on feature selection.
- Provide a clear explanation of hyperparameter tuning for LASSO.
Constraints
- The model must achieve an RMSE of less than $15,000 on the validation set.
- The solution should be scalable to handle larger datasets in the future.
- The implementation should be well-documented for future developers.