Business Context
Meta is building a binary classifier to detect policy-violating Marketplace listings before they are shown broadly. Violations are rare, but missing them is costly, so the ranking and integrity teams need the right offline metric to compare models and set thresholds.
Dataset
You are given a labeled offline training set of Marketplace listings with model scores from a baseline logistic regression and a candidate gradient-boosted tree model.
| Feature Group | Count | Examples |
|---|
| Listing metadata | 12 | price, category, seller_age_days, image_count |
| Seller behavior | 9 | prior_reports_30d, refund_rate, listing_velocity |
| Text-derived signals | 15 | keyword_risk_score, title_length, embedding_cluster_id |
| Image/model scores | 6 | vision_risk_score, OCR_risk_score, baseline_score, candidate_score |
- Rows: 2.4M listings from the last 90 days
- Target:
is_violation (1 if the listing was confirmed violating, else 0)
- Class balance: 0.9% positive, 99.1% negative
- Missing data: ~8% missing in seller history for new sellers; ~3% missing OCR/image features
Success Criteria
A strong solution should:
- Explain ROC-AUC and PR-AUC clearly and mathematically
- Show why ROC-AUC can look strong even when positive-class performance is weak under heavy imbalance
- Use the provided scores to compute both metrics and recommend which one should drive model selection
- Propose an operating threshold aligned with an integrity-review queue
Constraints
- Review capacity is fixed: only the top 2% highest-risk listings can be sent to human review
- The final explanation must be understandable to both ML engineers and policy operations partners
- Inference is near-real-time, so the chosen approach should not require expensive post-processing
Deliverables
- Compute ROC-AUC and PR-AUC for both models on a held-out set.
- Explain when each metric is appropriate and why class imbalance matters.
- Recommend which model Meta should ship and justify the decision.
- Select a threshold for the review queue and report precision, recall, and confusion-matrix counts at that threshold.
- Briefly discuss how calibration and prevalence shifts would affect interpretation of these metrics.