Business Context
You’re on the Trust & Safety ML team at Mercari-like recommerce marketplace operating in North America and Japan with 18M monthly active users and ~6.5M new listings/day. Fraudsters create coordinated rings to post scam listings, launder payments, and evade bans. The business impact is material: fraud losses are $8–12M/quarter, plus chargeback fees and reputation risk. Your team is asked to design a modeling approach that uses multiple signals (listing text, images, and the user–device–payment graph) to produce a risk score per listing at creation time.
The extracted interview prompt is: “What are the different types of neural network architectures and their applications?” In this interview, you must answer that question by applying it to a real production system: choose architectures that match each data modality, justify trade-offs, and propose an end-to-end training/evaluation plan.
Dataset
You have 90 days of labeled data from investigations and chargebacks.
| Feature Group | Scale / Shape | Examples | Notes |
|---|
| Listing text | 1.2M listings, avg 55 tokens | title, description, category path | Multilingual (en/ja), heavy templating/spam |
| Listing images | 1.2M listings, 1–6 images/listing | JPEGs, 256–1024px | Some near-duplicates, adversarial crops |
| Behavioral aggregates | 1.2M rows, 40 numeric | account_age_days, prior_refunds, velocity features | Strong leakage risk if not time-split |
| Entity graph | ~28M nodes, ~110M edges | user↔device, user↔payment, user↔shipping | Dynamic; new nodes daily |
| Target | binary | fraud_within_14d | Label delay; some positives appear after 14 days |
- Class balance: ~0.7% confirmed fraud positives, ~99.3% negatives.
- Missingness: ~12% listings have no images; ~8% have truncated text; graph edges missing for brand-new users.
Success Criteria
- Operational goal: At listing creation, flag a review queue of at most 0.8% of daily listings (~52K/day).
- Quality goal: Achieve ≥ 65% recall on confirmed fraud while keeping precision ≥ 20% at the chosen threshold (review team capacity constraint).
- Stability goal: Performance degradation under distribution shift (new scam campaigns) should be limited; weekly monitoring must detect drift.
Constraints
- Latency: p95 end-to-end scoring < 120 ms per listing (including feature fetch). Images may be processed asynchronously, but an initial score must be available.
- Compute: Training budget is 8×A100 for 6 hours daily; inference runs on CPU + limited GPU pool.
- Compliance: Must provide human-readable reasons for review decisions; avoid using protected attributes.
- Cold-start: Must handle brand-new accounts with minimal graph history.
Deliverables
- Propose at least 4 neural architectures (e.g., CNN/ViT, RNN/Transformer, GNN, MLP) and map each to dataset modalities and tasks.
- Recommend a final system design (single model vs late-fusion ensemble vs two-stage pipeline) and justify it.
- Define a training and validation strategy that avoids leakage and handles label delay.
- Specify evaluation metrics and thresholding aligned to review capacity.
- Provide a minimal but production-oriented Python training skeleton (can be simplified) showing preprocessing, model(s), and evaluation.