Business Context
Meta wants to predict whether a Facebook Feed impression will receive a click so ranking teams can compare simple and complex models before deployment. You are asked to explain the bias-variance trade-off with a concrete modeling exercise rather than a purely theoretical answer.
Dataset
You are given an offline training dataset built from historical Facebook Feed impressions.
| Feature Group | Count | Examples |
|---|
| User features | 8 | account_age_days, prior_ctr_7d, sessions_7d, follows_count |
| Content features | 7 | post_type, media_count, text_length, page_category |
| Context features | 6 | hour_of_day, device_type, network_type, country_tier |
| Interaction features | 5 | user_page_affinity, recent_video_watch_rate, prior_page_ctr |
| Label | 1 | clicked (1) / not clicked (0) |
- Size: 1.2M impressions, 26 input features
- Target: Binary classification — whether the impression was clicked
- Class balance: 11.4% positive, 88.6% negative
- Missing data: ~6% missing in prior engagement features for new or low-activity users; ~2% missing in content metadata
Success Criteria
A strong solution should clearly demonstrate, with metrics, how underfit and overfit models behave and identify a model complexity level that generalizes well. Good enough means improving validation and test log loss over a naive baseline while keeping train/validation gaps small and explaining the observed bias-variance trade-off.
Constraints
- Inference should stay under 10 ms per impression in an online ranking service
- The solution should be interpretable enough to explain why a simpler model may outperform a more complex one on unseen data
- Training should be feasible on a single machine for experimentation
Deliverables
- Train at least three models with increasing complexity (for example: regularized logistic regression, shallow decision tree, deep decision tree or high-degree polynomial model).
- Compare train, validation, and test performance using appropriate classification metrics.
- Explain which model shows high bias, which shows high variance, and how regularization or pruning changes the trade-off.
- Recommend a production-ready model for Facebook Feed CTR prediction under the stated latency constraint.
- Provide concise Python code that reproduces preprocessing, training, and evaluation.