Bagging vs Boosting for Integrity Risk

Business Context

Meta's Integrity team wants a model to predict whether a newly created Facebook account will be actioned for coordinated spam or fake engagement within 7 days. You need to compare bagging and boosting in a realistic classification setting and recommend which ensemble approach should be deployed.

Dataset

You are given a training table built from account creation, graph, and early activity signals.

Feature Group	Count	Examples
Account metadata	8	account_age_hours, signup_surface, country, device_os
Activity features	14	posts_first_24h, friend_requests_sent, groups_joined, outbound_message_count
Graph features	9	accepted_request_rate, clustering_coefficient, mutual_friends_p50
Integrity heuristics	6	prior_device_risk, IP_reputation_score, velocity_bucket
Temporal features	5	hour_of_day_created, weekend_signup, session_gap_minutes

Size: 420K accounts, 42 features
Target: enforced_7d — 1 if the account is actioned within 7 days, else 0
Class balance: 6.4% positive, 93.6% negative
Missing data: ~12% missing in graph features for cold-start accounts; ~4% missing in device-level fields

Success Criteria

A good solution should:

Beat a single decision tree baseline by a meaningful margin
Show a clear comparison between a bagging method and a boosting method
Achieve PR-AUC >= 0.42 and recall >= 0.75 at precision >= 0.30 on the test set
Explain which method is preferable for Meta's production constraints

Constraints

Batch scoring runs every 15 minutes; p95 inference latency should stay under 50 ms per 1K accounts
The model must support feature importance analysis for Integrity analysts
Retraining happens weekly; the pipeline should tolerate moderate feature drift

Deliverables

Train a single decision tree baseline, a bagging model, and a boosting model.
Explain the conceptual difference between bagging and boosting in the context of this dataset.
Compare performance using PR-AUC, ROC-AUC, recall at fixed precision, and calibration.
Recommend one approach for deployment and justify the tradeoffs.
Describe how you would monitor degradation after launch.

Business Context

Dataset

You are given a training table built from account creation, graph, and early activity signals.

Feature Group	Count	Examples
Account metadata	8	account_age_hours, signup_surface, country, device_os
Activity features	14	posts_first_24h, friend_requests_sent, groups_joined, outbound_message_count
Graph features	9	accepted_request_rate, clustering_coefficient, mutual_friends_p50
Integrity heuristics	6	prior_device_risk, IP_reputation_score, velocity_bucket
Temporal features	5	hour_of_day_created, weekend_signup, session_gap_minutes

Size: 420K accounts, 42 features
Target: enforced_7d — 1 if the account is actioned within 7 days, else 0
Class balance: 6.4% positive, 93.6% negative
Missing data: ~12% missing in graph features for cold-start accounts; ~4% missing in device-level fields

Success Criteria

A good solution should:

Beat a single decision tree baseline by a meaningful margin
Show a clear comparison between a bagging method and a boosting method
Achieve PR-AUC >= 0.42 and recall >= 0.75 at precision >= 0.30 on the test set
Explain which method is preferable for Meta's production constraints

Constraints

Batch scoring runs every 15 minutes; p95 inference latency should stay under 50 ms per 1K accounts
The model must support feature importance analysis for Integrity analysts
Retraining happens weekly; the pipeline should tolerate moderate feature drift

Deliverables

Train a single decision tree baseline, a bagging model, and a boosting model.
Explain the conceptual difference between bagging and boosting in the context of this dataset.
Compare performance using PR-AUC, ROC-AUC, recall at fixed precision, and calibration.
Recommend one approach for deployment and justify the tradeoffs.
Describe how you would monitor degradation after launch.

Business Context

Dataset

You are given a training table built from account creation, graph, and early activity signals.

Feature Group	Count	Examples
Account metadata	8	account_age_hours, signup_surface, country, device_os
Activity features	14	posts_first_24h, friend_requests_sent, groups_joined, outbound_message_count
Graph features	9	accepted_request_rate, clustering_coefficient, mutual_friends_p50
Integrity heuristics	6	prior_device_risk, IP_reputation_score, velocity_bucket
Temporal features	5	hour_of_day_created, weekend_signup, session_gap_minutes

Size: 420K accounts, 42 features
Target: enforced_7d — 1 if the account is actioned within 7 days, else 0
Class balance: 6.4% positive, 93.6% negative
Missing data: ~12% missing in graph features for cold-start accounts; ~4% missing in device-level fields

Success Criteria

A good solution should:

Beat a single decision tree baseline by a meaningful margin
Show a clear comparison between a bagging method and a boosting method
Achieve PR-AUC >= 0.42 and recall >= 0.75 at precision >= 0.30 on the test set
Explain which method is preferable for Meta's production constraints

Constraints

Batch scoring runs every 15 minutes; p95 inference latency should stay under 50 ms per 1K accounts
The model must support feature importance analysis for Integrity analysts
Retraining happens weekly; the pipeline should tolerate moderate feature drift

Deliverables

Train a single decision tree baseline, a bagging model, and a boosting model.
Explain the conceptual difference between bagging and boosting in the context of this dataset.
Compare performance using PR-AUC, ROC-AUC, recall at fixed precision, and calibration.
Recommend one approach for deployment and justify the tradeoffs.
Describe how you would monitor degradation after launch.

Business Context

Dataset

You are given a training table built from account creation, graph, and early activity signals.

Feature Group	Count	Examples
Account metadata	8	account_age_hours, signup_surface, country, device_os
Activity features	14	posts_first_24h, friend_requests_sent, groups_joined, outbound_message_count
Graph features	9	accepted_request_rate, clustering_coefficient, mutual_friends_p50
Integrity heuristics	6	prior_device_risk, IP_reputation_score, velocity_bucket
Temporal features	5	hour_of_day_created, weekend_signup, session_gap_minutes

Size: 420K accounts, 42 features
Target: enforced_7d — 1 if the account is actioned within 7 days, else 0
Class balance: 6.4% positive, 93.6% negative
Missing data: ~12% missing in graph features for cold-start accounts; ~4% missing in device-level fields

Success Criteria

A good solution should:

Beat a single decision tree baseline by a meaningful margin
Show a clear comparison between a bagging method and a boosting method
Achieve PR-AUC >= 0.42 and recall >= 0.75 at precision >= 0.30 on the test set
Explain which method is preferable for Meta's production constraints

Constraints

Batch scoring runs every 15 minutes; p95 inference latency should stay under 50 ms per 1K accounts
The model must support feature importance analysis for Integrity analysts
Retraining happens weekly; the pipeline should tolerate moderate feature drift

Deliverables

Train a single decision tree baseline, a bagging model, and a boosting model.
Explain the conceptual difference between bagging and boosting in the context of this dataset.
Compare performance using PR-AUC, ROC-AUC, recall at fixed precision, and calibration.
Recommend one approach for deployment and justify the tradeoffs.
Describe how you would monitor degradation after launch.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Bagging vs Boosting for Integrity Risk

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Bagging vs Boosting for Integrity Risk

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Bagging vs Boosting for Integrity Risk

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer