Business Context
HomeValue Analytics builds automated valuation models for regional real estate platforms. The pricing team wants a regression model that generalizes well to new listings and avoids overfitting on sparse, high-dimensional property features.
Dataset
You are given a tabular dataset of residential home sales from a mid-sized U.S. metro area.
| Feature Group | Count | Examples |
|---|
| Numerical property features | 18 | square_feet, lot_size, year_built, bedrooms, bathrooms |
| Categorical location/features | 9 | neighborhood, exterior_type, heating_type, condition_grade |
| Engineered listing attributes | 11 | age_of_home, price_per_sqft_neighborhood_avg, renovation_flag |
| Sparse binary amenities | 22 | pool, garage, basement, solar, waterfront |
| | |
- Size: 24K home sales, 60 input features after basic cleaning
- Target: Continuous — final sale price in USD
- Missing data: 8% missing in lot_size, 12% in renovation_year, 3-5% in several categorical fields
- Data characteristics: Moderate multicollinearity across size/location features and a long-tailed price distribution
Success Criteria
A good solution should:
- Beat an unregularized linear regression baseline on holdout RMSE by at least 8%
- Achieve stable cross-validation performance with low train/validation gap
- Explain when L1, L2, and Elastic Net regularization are useful
Constraints
- Model must remain interpretable enough for pricing analysts
- Batch inference only; latency is not critical
- Training should run on a standard laptop in under 10 minutes
Deliverables
- Explain what regularization is and why it is useful in regression.
- Build and compare Linear Regression, Ridge, Lasso, and Elastic Net models.
- Use a leakage-safe preprocessing pipeline for missing values, scaling, and encoding.
- Tune regularization strength with cross-validation and report holdout performance.
- Interpret coefficient behavior and discuss the bias-variance tradeoff for each model.