Select Neural Architectures for Multimodal Trust & Safety

Business Context

You’re on the Trust & Safety ML team at Mercari-like recommerce marketplace operating in North America and Japan with 18M monthly active users and ~6.5M new listings/day. Fraudsters create coordinated rings to post scam listings, launder payments, and evade bans. The business impact is material: fraud losses are $8–12M/quarter, plus chargeback fees and reputation risk. Your team is asked to design a modeling approach that uses multiple signals (listing text, images, and the user–device–payment graph) to produce a risk score per listing at creation time.

The extracted interview prompt is: “What are the different types of neural network architectures and their applications?” In this interview, you must answer that question by applying it to a real production system: choose architectures that match each data modality, justify trade-offs, and propose an end-to-end training/evaluation plan.

Dataset

You have 90 days of labeled data from investigations and chargebacks.

Feature Group	Scale / Shape	Examples	Notes
Listing text	1.2M listings, avg 55 tokens	title, description, category path	Multilingual (en/ja), heavy templating/spam
Listing images	1.2M listings, 1–6 images/listing	JPEGs, 256–1024px	Some near-duplicates, adversarial crops
Behavioral aggregates	1.2M rows, 40 numeric	account_age_days, prior_refunds, velocity features	Strong leakage risk if not time-split
Entity graph	~28M nodes, ~110M edges	user↔device, user↔payment, user↔shipping	Dynamic; new nodes daily
Target	binary	fraud_within_14d	Label delay; some positives appear after 14 days

Class balance: ~0.7% confirmed fraud positives, ~99.3% negatives.
Missingness: ~12% listings have no images; ~8% have truncated text; graph edges missing for brand-new users.

Success Criteria

Operational goal: At listing creation, flag a review queue of at most 0.8% of daily listings (~52K/day).
Quality goal: Achieve ≥ 65% recall on confirmed fraud while keeping precision ≥ 20% at the chosen threshold (review team capacity constraint).
Stability goal: Performance degradation under distribution shift (new scam campaigns) should be limited; weekly monitoring must detect drift.

Constraints

Latency: p95 end-to-end scoring < 120 ms per listing (including feature fetch). Images may be processed asynchronously, but an initial score must be available.
Compute: Training budget is 8×A100 for 6 hours daily; inference runs on CPU + limited GPU pool.
Compliance: Must provide human-readable reasons for review decisions; avoid using protected attributes.
Cold-start: Must handle brand-new accounts with minimal graph history.

Deliverables

Propose at least 4 neural architectures (e.g., CNN/ViT, RNN/Transformer, GNN, MLP) and map each to dataset modalities and tasks.
Recommend a final system design (single model vs late-fusion ensemble vs two-stage pipeline) and justify it.
Define a training and validation strategy that avoids leakage and handles label delay.
Specify evaluation metrics and thresholding aligned to review capacity.
Provide a minimal but production-oriented Python training skeleton (can be simplified) showing preprocessing, model(s), and evaluation.

Business Context

Dataset

You have 90 days of labeled data from investigations and chargebacks.

Feature Group	Scale / Shape	Examples	Notes
Listing text	1.2M listings, avg 55 tokens	title, description, category path	Multilingual (en/ja), heavy templating/spam
Listing images	1.2M listings, 1–6 images/listing	JPEGs, 256–1024px	Some near-duplicates, adversarial crops
Behavioral aggregates	1.2M rows, 40 numeric	account_age_days, prior_refunds, velocity features	Strong leakage risk if not time-split
Entity graph	~28M nodes, ~110M edges	user↔device, user↔payment, user↔shipping	Dynamic; new nodes daily
Target	binary	fraud_within_14d	Label delay; some positives appear after 14 days

Class balance: ~0.7% confirmed fraud positives, ~99.3% negatives.
Missingness: ~12% listings have no images; ~8% have truncated text; graph edges missing for brand-new users.

Success Criteria

Operational goal: At listing creation, flag a review queue of at most 0.8% of daily listings (~52K/day).
Quality goal: Achieve ≥ 65% recall on confirmed fraud while keeping precision ≥ 20% at the chosen threshold (review team capacity constraint).
Stability goal: Performance degradation under distribution shift (new scam campaigns) should be limited; weekly monitoring must detect drift.

Constraints

Latency: p95 end-to-end scoring < 120 ms per listing (including feature fetch). Images may be processed asynchronously, but an initial score must be available.
Compute: Training budget is 8×A100 for 6 hours daily; inference runs on CPU + limited GPU pool.
Compliance: Must provide human-readable reasons for review decisions; avoid using protected attributes.
Cold-start: Must handle brand-new accounts with minimal graph history.

Deliverables

Propose at least 4 neural architectures (e.g., CNN/ViT, RNN/Transformer, GNN, MLP) and map each to dataset modalities and tasks.
Recommend a final system design (single model vs late-fusion ensemble vs two-stage pipeline) and justify it.
Define a training and validation strategy that avoids leakage and handles label delay.
Specify evaluation metrics and thresholding aligned to review capacity.
Provide a minimal but production-oriented Python training skeleton (can be simplified) showing preprocessing, model(s), and evaluation.

Business Context

Dataset

You have 90 days of labeled data from investigations and chargebacks.

Feature Group	Scale / Shape	Examples	Notes
Listing text	1.2M listings, avg 55 tokens	title, description, category path	Multilingual (en/ja), heavy templating/spam
Listing images	1.2M listings, 1–6 images/listing	JPEGs, 256–1024px	Some near-duplicates, adversarial crops
Behavioral aggregates	1.2M rows, 40 numeric	account_age_days, prior_refunds, velocity features	Strong leakage risk if not time-split
Entity graph	~28M nodes, ~110M edges	user↔device, user↔payment, user↔shipping	Dynamic; new nodes daily
Target	binary	fraud_within_14d	Label delay; some positives appear after 14 days

Class balance: ~0.7% confirmed fraud positives, ~99.3% negatives.
Missingness: ~12% listings have no images; ~8% have truncated text; graph edges missing for brand-new users.

Success Criteria

Operational goal: At listing creation, flag a review queue of at most 0.8% of daily listings (~52K/day).
Quality goal: Achieve ≥ 65% recall on confirmed fraud while keeping precision ≥ 20% at the chosen threshold (review team capacity constraint).
Stability goal: Performance degradation under distribution shift (new scam campaigns) should be limited; weekly monitoring must detect drift.

Constraints

Latency: p95 end-to-end scoring < 120 ms per listing (including feature fetch). Images may be processed asynchronously, but an initial score must be available.
Compute: Training budget is 8×A100 for 6 hours daily; inference runs on CPU + limited GPU pool.
Compliance: Must provide human-readable reasons for review decisions; avoid using protected attributes.
Cold-start: Must handle brand-new accounts with minimal graph history.

Deliverables

Propose at least 4 neural architectures (e.g., CNN/ViT, RNN/Transformer, GNN, MLP) and map each to dataset modalities and tasks.
Recommend a final system design (single model vs late-fusion ensemble vs two-stage pipeline) and justify it.
Define a training and validation strategy that avoids leakage and handles label delay.
Specify evaluation metrics and thresholding aligned to review capacity.
Provide a minimal but production-oriented Python training skeleton (can be simplified) showing preprocessing, model(s), and evaluation.

Business Context

Dataset

You have 90 days of labeled data from investigations and chargebacks.

Feature Group	Scale / Shape	Examples	Notes
Listing text	1.2M listings, avg 55 tokens	title, description, category path	Multilingual (en/ja), heavy templating/spam
Listing images	1.2M listings, 1–6 images/listing	JPEGs, 256–1024px	Some near-duplicates, adversarial crops
Behavioral aggregates	1.2M rows, 40 numeric	account_age_days, prior_refunds, velocity features	Strong leakage risk if not time-split
Entity graph	~28M nodes, ~110M edges	user↔device, user↔payment, user↔shipping	Dynamic; new nodes daily
Target	binary	fraud_within_14d	Label delay; some positives appear after 14 days

Class balance: ~0.7% confirmed fraud positives, ~99.3% negatives.
Missingness: ~12% listings have no images; ~8% have truncated text; graph edges missing for brand-new users.

Success Criteria

Operational goal: At listing creation, flag a review queue of at most 0.8% of daily listings (~52K/day).
Quality goal: Achieve ≥ 65% recall on confirmed fraud while keeping precision ≥ 20% at the chosen threshold (review team capacity constraint).
Stability goal: Performance degradation under distribution shift (new scam campaigns) should be limited; weekly monitoring must detect drift.

Constraints

Latency: p95 end-to-end scoring < 120 ms per listing (including feature fetch). Images may be processed asynchronously, but an initial score must be available.
Compute: Training budget is 8×A100 for 6 hours daily; inference runs on CPU + limited GPU pool.
Compliance: Must provide human-readable reasons for review decisions; avoid using protected attributes.
Cold-start: Must handle brand-new accounts with minimal graph history.

Deliverables

Propose at least 4 neural architectures (e.g., CNN/ViT, RNN/Transformer, GNN, MLP) and map each to dataset modalities and tasks.
Recommend a final system design (single model vs late-fusion ensemble vs two-stage pipeline) and justify it.
Define a training and validation strategy that avoids leakage and handles label delay.
Specify evaluation metrics and thresholding aligned to review capacity.
Provide a minimal but production-oriented Python training skeleton (can be simplified) showing preprocessing, model(s), and evaluation.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Neural Architectures for Multimodal Trust & Safety

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Select Neural Architectures for Multimodal Trust & Safety

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Select Neural Architectures for Multimodal Trust & Safety

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer