Business Context
RedfinNow wants a regression model to estimate U.S. residential sale prices for seller offers. The pricing team needs a tuned Gradient Boosting model that improves valuation accuracy while remaining fast enough for daily batch scoring across new listings.
Dataset
You are given a historical property sales dataset covering 24 months of transactions from 12 metro areas.
| Feature Group | Count | Examples |
|---|
| Property attributes | 14 | square_feet, lot_size, bedrooms, bathrooms, year_built |
| Location | 9 | zip_code, school_rating, distance_to_downtown, crime_index |
| Listing context | 8 | days_on_market, listing_month, seller_type, renovation_flag |
| Market signals | 7 | neighborhood_median_price_30d, inventory_level, mortgage_rate |
| Engineered fields | 6 | price_per_sqft_area_avg, home_age, sqft_per_bedroom |
- Size: 148K home sales, 44 input features
- Target: Final sale price in USD
- Target distribution: Right-skewed, with luxury outliers above $2.5M
- Missing data: 12% missing in renovation_flag and lot_size, 4% missing in school_rating, sparse missingness elsewhere
Success Criteria
A good solution should:
- Achieve RMSE below $42K on the held-out test set
- Improve on a regularized linear baseline by at least 12% RMSE reduction
- Provide a clear, leakage-safe tuning process for Gradient Boosting regression
- Explain which hyperparameters matter most and how overfitting is controlled
Constraints
- Batch inference for 80K listings must finish in under 10 minutes
- The business wants partial interpretability via feature importance or PDP/SHAP summaries
- Retraining will happen monthly, so tuning cannot require excessive compute
- No target leakage from post-listing or post-sale fields
Deliverables
- Build a preprocessing and training pipeline for a Gradient Boosting regressor.
- Define a train/validation/test strategy and justify it.
- Tune key hyperparameters and explain the search space.
- Evaluate with regression metrics and compare against a baseline.
- Summarize feature importance, error patterns, and production tradeoffs.