Business Context
HomeValue, a residential real-estate marketplace, wants a simple and interpretable model to estimate apartment sale prices for listing guidance in one major city. The pricing team needs a baseline regression model that can be retrained weekly and explained to non-technical stakeholders.
Dataset
You are given a historical dataset of apartment sales from the last 24 months.
| Feature Group | Count | Examples |
|---|
| Property attributes | 8 | square_feet, bedrooms, bathrooms, floor_number, building_age_years |
| Location | 5 | neighborhood, distance_to_metro_km, school_rating, zip_code, latitude_bucket |
| Listing metadata | 4 | days_on_market, renovated_flag, parking_spaces, hoa_fee |
| Market context | 3 | month_of_sale, mortgage_rate, local_price_index |
- Size: 52K sold apartments, 20 input features
- Target: Final sale price in USD
- Feature types: Numerical and categorical
- Missing data: ~6% missing in
hoa_fee, ~4% in school_rating, ~2% in parking_spaces
- Outliers: Luxury penthouses create a long right tail in price
Success Criteria
A solution is considered good enough if it:
- Achieves RMSE below $45,000 on a held-out test set
- Achieves MAE below $28,000
- Produces interpretable coefficients or feature effects for the pricing team
Constraints
- Prefer a linear-model-based approach for interpretability
- Training should complete in minutes on a laptop
- Batch inference for 10K listings should finish in under 1 second
- The solution must avoid target leakage from post-sale fields
Deliverables
- Build a linear regression pipeline in Python for price prediction.
- Preprocess missing values and categorical variables correctly.
- Evaluate the model using regression metrics on train/validation/test splits.
- Explain how you would interpret coefficients and diagnose underfitting or multicollinearity.
- Suggest at least one improvement beyond plain linear regression, such as Ridge or Lasso regularization.