Select Features for Ads Quality

Business Context

Google Ads wants to predict whether a newly created search ad will receive a low quality rating within 7 days, so policy and optimization systems can intervene early. You are given a supervised learning dataset and asked to design a feature selection strategy that improves model quality without introducing leakage or making the model too expensive to serve.

Dataset

The training data contains ad-level snapshots collected at ad creation time and aggregated over the first 24 hours only.

Feature Group	Count	Examples
Ad text and metadata	18	headline_length, description_length, keyword_match_type, language, device_targeting
Advertiser/account signals	12	account_age_days, prior_policy_strikes, campaign_budget, vertical
Landing page features	10	page_load_ms, mobile_friendly_score, content_length, https_enabled
Early performance aggregates	14	impressions_24h, ctr_24h, avg_cpc_24h, bounce_rate_24h
Geography/time	6	country, hour_created, day_of_week, market_tier

Rows: 420K ads from the last 9 months
Target: low_quality_7d = 1 if the ad is rated low quality within 7 days, else 0
Class balance: 11.5% positive, 88.5% negative
Missing data: ~9% missing in landing page features, ~4% missing in early performance for low-traffic ads

Success Criteria

A good solution should improve validation PR-AUC over a simple all-features logistic regression baseline by at least 10%, while keeping offline feature generation simple enough for daily retraining and online scoring under 20 ms per ad.

Constraints

Avoid target leakage from post-label information
Prefer features that can be computed consistently in Google Ads production pipelines
The final model must remain interpretable enough for QA and policy review
Feature computation cost matters more than squeezing out tiny offline gains

Deliverables

Propose a feature selection framework for this supervised classification problem.
Build a baseline model and a reduced-feature model.
Justify which features to keep, transform, or remove.
Show how you would validate feature usefulness without leakage.
Report evaluation metrics and explain tradeoffs between performance, interpretability, and serving cost.

Business Context

Dataset

The training data contains ad-level snapshots collected at ad creation time and aggregated over the first 24 hours only.

Feature Group	Count	Examples
Ad text and metadata	18	headline_length, description_length, keyword_match_type, language, device_targeting
Advertiser/account signals	12	account_age_days, prior_policy_strikes, campaign_budget, vertical
Landing page features	10	page_load_ms, mobile_friendly_score, content_length, https_enabled
Early performance aggregates	14	impressions_24h, ctr_24h, avg_cpc_24h, bounce_rate_24h
Geography/time	6	country, hour_created, day_of_week, market_tier

Rows: 420K ads from the last 9 months
Target: low_quality_7d = 1 if the ad is rated low quality within 7 days, else 0
Class balance: 11.5% positive, 88.5% negative
Missing data: ~9% missing in landing page features, ~4% missing in early performance for low-traffic ads

Success Criteria

Constraints

Avoid target leakage from post-label information
Prefer features that can be computed consistently in Google Ads production pipelines
The final model must remain interpretable enough for QA and policy review
Feature computation cost matters more than squeezing out tiny offline gains

Deliverables

Propose a feature selection framework for this supervised classification problem.
Build a baseline model and a reduced-feature model.
Justify which features to keep, transform, or remove.
Show how you would validate feature usefulness without leakage.
Report evaluation metrics and explain tradeoffs between performance, interpretability, and serving cost.

Business Context

Dataset

The training data contains ad-level snapshots collected at ad creation time and aggregated over the first 24 hours only.

Feature Group	Count	Examples
Ad text and metadata	18	headline_length, description_length, keyword_match_type, language, device_targeting
Advertiser/account signals	12	account_age_days, prior_policy_strikes, campaign_budget, vertical
Landing page features	10	page_load_ms, mobile_friendly_score, content_length, https_enabled
Early performance aggregates	14	impressions_24h, ctr_24h, avg_cpc_24h, bounce_rate_24h
Geography/time	6	country, hour_created, day_of_week, market_tier

Rows: 420K ads from the last 9 months
Target: low_quality_7d = 1 if the ad is rated low quality within 7 days, else 0
Class balance: 11.5% positive, 88.5% negative
Missing data: ~9% missing in landing page features, ~4% missing in early performance for low-traffic ads

Success Criteria

Constraints

Avoid target leakage from post-label information
Prefer features that can be computed consistently in Google Ads production pipelines
The final model must remain interpretable enough for QA and policy review
Feature computation cost matters more than squeezing out tiny offline gains

Deliverables

Propose a feature selection framework for this supervised classification problem.
Build a baseline model and a reduced-feature model.
Justify which features to keep, transform, or remove.
Show how you would validate feature usefulness without leakage.
Report evaluation metrics and explain tradeoffs between performance, interpretability, and serving cost.

Business Context

Dataset

The training data contains ad-level snapshots collected at ad creation time and aggregated over the first 24 hours only.

Feature Group	Count	Examples
Ad text and metadata	18	headline_length, description_length, keyword_match_type, language, device_targeting
Advertiser/account signals	12	account_age_days, prior_policy_strikes, campaign_budget, vertical
Landing page features	10	page_load_ms, mobile_friendly_score, content_length, https_enabled
Early performance aggregates	14	impressions_24h, ctr_24h, avg_cpc_24h, bounce_rate_24h
Geography/time	6	country, hour_created, day_of_week, market_tier

Rows: 420K ads from the last 9 months
Target: low_quality_7d = 1 if the ad is rated low quality within 7 days, else 0
Class balance: 11.5% positive, 88.5% negative
Missing data: ~9% missing in landing page features, ~4% missing in early performance for low-traffic ads

Success Criteria

Constraints

Avoid target leakage from post-label information
Prefer features that can be computed consistently in Google Ads production pipelines
The final model must remain interpretable enough for QA and policy review
Feature computation cost matters more than squeezing out tiny offline gains

Deliverables

Propose a feature selection framework for this supervised classification problem.
Build a baseline model and a reduced-feature model.
Justify which features to keep, transform, or remove.
Show how you would validate feature usefulness without leakage.
Report evaluation metrics and explain tradeoffs between performance, interpretability, and serving cost.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Features for Ads Quality

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Select Features for Ads Quality

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Features for Ads Quality

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer