Tune Gradient Boosting for Home Prices

Business Context

RedfinNow wants a regression model to estimate U.S. residential sale prices for seller offers. The pricing team needs a tuned Gradient Boosting model that improves valuation accuracy while remaining fast enough for daily batch scoring across new listings.

Dataset

You are given a historical property sales dataset covering 24 months of transactions from 12 metro areas.

Feature Group	Count	Examples
Property attributes	14	square_feet, lot_size, bedrooms, bathrooms, year_built
Location	9	zip_code, school_rating, distance_to_downtown, crime_index
Listing context	8	days_on_market, listing_month, seller_type, renovation_flag
Market signals	7	neighborhood_median_price_30d, inventory_level, mortgage_rate
Engineered fields	6	price_per_sqft_area_avg, home_age, sqft_per_bedroom

Size: 148K home sales, 44 input features
Target: Final sale price in USD
Target distribution: Right-skewed, with luxury outliers above $2.5M
Missing data: 12% missing in renovation_flag and lot_size, 4% missing in school_rating, sparse missingness elsewhere

Success Criteria

A good solution should:

Achieve RMSE below $42K on the held-out test set
Improve on a regularized linear baseline by at least 12% RMSE reduction
Provide a clear, leakage-safe tuning process for Gradient Boosting regression
Explain which hyperparameters matter most and how overfitting is controlled

Constraints

Batch inference for 80K listings must finish in under 10 minutes
The business wants partial interpretability via feature importance or PDP/SHAP summaries
Retraining will happen monthly, so tuning cannot require excessive compute
No target leakage from post-listing or post-sale fields

Deliverables

Build a preprocessing and training pipeline for a Gradient Boosting regressor.
Define a train/validation/test strategy and justify it.
Tune key hyperparameters and explain the search space.
Evaluate with regression metrics and compare against a baseline.
Summarize feature importance, error patterns, and production tradeoffs.

Business Context

Dataset

You are given a historical property sales dataset covering 24 months of transactions from 12 metro areas.

Feature Group	Count	Examples
Property attributes	14	square_feet, lot_size, bedrooms, bathrooms, year_built
Location	9	zip_code, school_rating, distance_to_downtown, crime_index
Listing context	8	days_on_market, listing_month, seller_type, renovation_flag
Market signals	7	neighborhood_median_price_30d, inventory_level, mortgage_rate
Engineered fields	6	price_per_sqft_area_avg, home_age, sqft_per_bedroom

Size: 148K home sales, 44 input features
Target: Final sale price in USD
Target distribution: Right-skewed, with luxury outliers above $2.5M
Missing data: 12% missing in renovation_flag and lot_size, 4% missing in school_rating, sparse missingness elsewhere

Success Criteria

A good solution should:

Achieve RMSE below $42K on the held-out test set
Improve on a regularized linear baseline by at least 12% RMSE reduction
Provide a clear, leakage-safe tuning process for Gradient Boosting regression
Explain which hyperparameters matter most and how overfitting is controlled

Constraints

Batch inference for 80K listings must finish in under 10 minutes
The business wants partial interpretability via feature importance or PDP/SHAP summaries
Retraining will happen monthly, so tuning cannot require excessive compute
No target leakage from post-listing or post-sale fields

Deliverables

Build a preprocessing and training pipeline for a Gradient Boosting regressor.
Define a train/validation/test strategy and justify it.
Tune key hyperparameters and explain the search space.
Evaluate with regression metrics and compare against a baseline.
Summarize feature importance, error patterns, and production tradeoffs.

Business Context

Dataset

You are given a historical property sales dataset covering 24 months of transactions from 12 metro areas.

Feature Group	Count	Examples
Property attributes	14	square_feet, lot_size, bedrooms, bathrooms, year_built
Location	9	zip_code, school_rating, distance_to_downtown, crime_index
Listing context	8	days_on_market, listing_month, seller_type, renovation_flag
Market signals	7	neighborhood_median_price_30d, inventory_level, mortgage_rate
Engineered fields	6	price_per_sqft_area_avg, home_age, sqft_per_bedroom

Size: 148K home sales, 44 input features
Target: Final sale price in USD
Target distribution: Right-skewed, with luxury outliers above $2.5M
Missing data: 12% missing in renovation_flag and lot_size, 4% missing in school_rating, sparse missingness elsewhere

Success Criteria

A good solution should:

Achieve RMSE below $42K on the held-out test set
Improve on a regularized linear baseline by at least 12% RMSE reduction
Provide a clear, leakage-safe tuning process for Gradient Boosting regression
Explain which hyperparameters matter most and how overfitting is controlled

Constraints

Batch inference for 80K listings must finish in under 10 minutes
The business wants partial interpretability via feature importance or PDP/SHAP summaries
Retraining will happen monthly, so tuning cannot require excessive compute
No target leakage from post-listing or post-sale fields

Deliverables

Build a preprocessing and training pipeline for a Gradient Boosting regressor.
Define a train/validation/test strategy and justify it.
Tune key hyperparameters and explain the search space.
Evaluate with regression metrics and compare against a baseline.
Summarize feature importance, error patterns, and production tradeoffs.

Business Context

Dataset

You are given a historical property sales dataset covering 24 months of transactions from 12 metro areas.

Feature Group	Count	Examples
Property attributes	14	square_feet, lot_size, bedrooms, bathrooms, year_built
Location	9	zip_code, school_rating, distance_to_downtown, crime_index
Listing context	8	days_on_market, listing_month, seller_type, renovation_flag
Market signals	7	neighborhood_median_price_30d, inventory_level, mortgage_rate
Engineered fields	6	price_per_sqft_area_avg, home_age, sqft_per_bedroom

Size: 148K home sales, 44 input features
Target: Final sale price in USD
Target distribution: Right-skewed, with luxury outliers above $2.5M
Missing data: 12% missing in renovation_flag and lot_size, 4% missing in school_rating, sparse missingness elsewhere

Success Criteria

A good solution should:

Achieve RMSE below $42K on the held-out test set
Improve on a regularized linear baseline by at least 12% RMSE reduction
Provide a clear, leakage-safe tuning process for Gradient Boosting regression
Explain which hyperparameters matter most and how overfitting is controlled

Constraints

Batch inference for 80K listings must finish in under 10 minutes
The business wants partial interpretability via feature importance or PDP/SHAP summaries
Retraining will happen monthly, so tuning cannot require excessive compute
No target leakage from post-listing or post-sale fields

Deliverables

Build a preprocessing and training pipeline for a Gradient Boosting regressor.
Define a train/validation/test strategy and justify it.
Tune key hyperparameters and explain the search space.
Evaluate with regression metrics and compare against a baseline.
Summarize feature importance, error patterns, and production tradeoffs.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Tune Gradient Boosting for Home Prices

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Tune Gradient Boosting for Home Prices

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Tune Gradient Boosting for Home Prices

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer