Choose ROC-AUC vs PR-AUC

Business Context

Meta is building a binary classifier to detect policy-violating Marketplace listings before they are shown broadly. Violations are rare, but missing them is costly, so the ranking and integrity teams need the right offline metric to compare models and set thresholds.

Dataset

You are given a labeled offline training set of Marketplace listings with model scores from a baseline logistic regression and a candidate gradient-boosted tree model.

Feature Group	Count	Examples
Listing metadata	12	price, category, seller_age_days, image_count
Seller behavior	9	prior_reports_30d, refund_rate, listing_velocity
Text-derived signals	15	keyword_risk_score, title_length, embedding_cluster_id
Image/model scores	6	vision_risk_score, OCR_risk_score, baseline_score, candidate_score

Rows: 2.4M listings from the last 90 days
Target: is_violation (1 if the listing was confirmed violating, else 0)
Class balance: 0.9% positive, 99.1% negative
Missing data: ~8% missing in seller history for new sellers; ~3% missing OCR/image features

Success Criteria

A strong solution should:

Explain ROC-AUC and PR-AUC clearly and mathematically
Show why ROC-AUC can look strong even when positive-class performance is weak under heavy imbalance
Use the provided scores to compute both metrics and recommend which one should drive model selection
Propose an operating threshold aligned with an integrity-review queue

Constraints

Review capacity is fixed: only the top 2% highest-risk listings can be sent to human review
The final explanation must be understandable to both ML engineers and policy operations partners
Inference is near-real-time, so the chosen approach should not require expensive post-processing

Deliverables

Compute ROC-AUC and PR-AUC for both models on a held-out set.
Explain when each metric is appropriate and why class imbalance matters.
Recommend which model Meta should ship and justify the decision.
Select a threshold for the review queue and report precision, recall, and confusion-matrix counts at that threshold.
Briefly discuss how calibration and prevalence shifts would affect interpretation of these metrics.

Business Context

Dataset

You are given a labeled offline training set of Marketplace listings with model scores from a baseline logistic regression and a candidate gradient-boosted tree model.

Feature Group	Count	Examples
Listing metadata	12	price, category, seller_age_days, image_count
Seller behavior	9	prior_reports_30d, refund_rate, listing_velocity
Text-derived signals	15	keyword_risk_score, title_length, embedding_cluster_id
Image/model scores	6	vision_risk_score, OCR_risk_score, baseline_score, candidate_score

Rows: 2.4M listings from the last 90 days
Target: is_violation (1 if the listing was confirmed violating, else 0)
Class balance: 0.9% positive, 99.1% negative
Missing data: ~8% missing in seller history for new sellers; ~3% missing OCR/image features

Success Criteria

A strong solution should:

Explain ROC-AUC and PR-AUC clearly and mathematically
Show why ROC-AUC can look strong even when positive-class performance is weak under heavy imbalance
Use the provided scores to compute both metrics and recommend which one should drive model selection
Propose an operating threshold aligned with an integrity-review queue

Constraints

Review capacity is fixed: only the top 2% highest-risk listings can be sent to human review
The final explanation must be understandable to both ML engineers and policy operations partners
Inference is near-real-time, so the chosen approach should not require expensive post-processing

Deliverables

Compute ROC-AUC and PR-AUC for both models on a held-out set.
Explain when each metric is appropriate and why class imbalance matters.
Recommend which model Meta should ship and justify the decision.
Select a threshold for the review queue and report precision, recall, and confusion-matrix counts at that threshold.
Briefly discuss how calibration and prevalence shifts would affect interpretation of these metrics.

Business Context

Dataset

You are given a labeled offline training set of Marketplace listings with model scores from a baseline logistic regression and a candidate gradient-boosted tree model.

Feature Group	Count	Examples
Listing metadata	12	price, category, seller_age_days, image_count
Seller behavior	9	prior_reports_30d, refund_rate, listing_velocity
Text-derived signals	15	keyword_risk_score, title_length, embedding_cluster_id
Image/model scores	6	vision_risk_score, OCR_risk_score, baseline_score, candidate_score

Rows: 2.4M listings from the last 90 days
Target: is_violation (1 if the listing was confirmed violating, else 0)
Class balance: 0.9% positive, 99.1% negative
Missing data: ~8% missing in seller history for new sellers; ~3% missing OCR/image features

Success Criteria

A strong solution should:

Explain ROC-AUC and PR-AUC clearly and mathematically
Show why ROC-AUC can look strong even when positive-class performance is weak under heavy imbalance
Use the provided scores to compute both metrics and recommend which one should drive model selection
Propose an operating threshold aligned with an integrity-review queue

Constraints

Review capacity is fixed: only the top 2% highest-risk listings can be sent to human review
The final explanation must be understandable to both ML engineers and policy operations partners
Inference is near-real-time, so the chosen approach should not require expensive post-processing

Deliverables

Compute ROC-AUC and PR-AUC for both models on a held-out set.
Explain when each metric is appropriate and why class imbalance matters.
Recommend which model Meta should ship and justify the decision.
Select a threshold for the review queue and report precision, recall, and confusion-matrix counts at that threshold.
Briefly discuss how calibration and prevalence shifts would affect interpretation of these metrics.

Business Context

Dataset

You are given a labeled offline training set of Marketplace listings with model scores from a baseline logistic regression and a candidate gradient-boosted tree model.

Feature Group	Count	Examples
Listing metadata	12	price, category, seller_age_days, image_count
Seller behavior	9	prior_reports_30d, refund_rate, listing_velocity
Text-derived signals	15	keyword_risk_score, title_length, embedding_cluster_id
Image/model scores	6	vision_risk_score, OCR_risk_score, baseline_score, candidate_score

Rows: 2.4M listings from the last 90 days
Target: is_violation (1 if the listing was confirmed violating, else 0)
Class balance: 0.9% positive, 99.1% negative
Missing data: ~8% missing in seller history for new sellers; ~3% missing OCR/image features

Success Criteria

A strong solution should:

Explain ROC-AUC and PR-AUC clearly and mathematically
Show why ROC-AUC can look strong even when positive-class performance is weak under heavy imbalance
Use the provided scores to compute both metrics and recommend which one should drive model selection
Propose an operating threshold aligned with an integrity-review queue

Constraints

Review capacity is fixed: only the top 2% highest-risk listings can be sent to human review
The final explanation must be understandable to both ML engineers and policy operations partners
Inference is near-real-time, so the chosen approach should not require expensive post-processing

Deliverables

Compute ROC-AUC and PR-AUC for both models on a held-out set.
Explain when each metric is appropriate and why class imbalance matters.
Recommend which model Meta should ship and justify the decision.
Select a threshold for the review queue and report precision, recall, and confusion-matrix counts at that threshold.
Briefly discuss how calibration and prevalence shifts would affect interpretation of these metrics.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Choose ROC-AUC vs PR-AUC

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Choose ROC-AUC vs PR-AUC

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Choose ROC-AUC vs PR-AUC

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer