Pretrained vs Scratch Product Classification

Business Context

You’re on the ML Platform team at MercuryMart, a global e-commerce marketplace with 35M monthly active users and ~120M active SKUs. A new regulatory policy in the EU requires the company to correctly classify products into a restricted set of 42 compliance categories (e.g., medical devices, children’s products, lithium batteries) to drive downstream workflows: listing eligibility, shipping restrictions, and mandatory disclosures.

Misclassification has real cost: false negatives can trigger regulatory fines and forced delistings, while false positives reduce seller revenue and harm trust. The business wants a model in production within 8 weeks, and the system must support near-real-time classification for newly created listings.

The key decision: use a pre-trained model (transfer learning) vs train a model from scratch. You must propose and justify an approach, including a plan to evaluate trade-offs and deploy safely.

Dataset

Each listing has a title, description, and structured metadata. Labels come from a combination of seller-provided category + human review; label noise is non-trivial.

Feature Group	Type	Approx. Columns	Examples	Notes
Text	unstructured	2	title, description	multilingual (EN/DE/FR/ES), average 35 tokens title, 180 tokens description
Taxonomy metadata	categorical	6	seller_category_path, brand, country, seller_tier	high cardinality brand (~400K)
Numeric	numerical	10	price, weight, dimensions, battery_capacity_mah	missingness varies by category
Images (optional)	unstructured	1-6 per item	main_image_url(s)	only 60% of listings have images at creation time

Size: ~8.5M labeled listings over 18 months
Target: 42-way multiclass compliance category
Class balance: long tail — top 5 classes = 62% of data; bottom 10 classes each <0.3%
Missing data: 20% missing numeric shipping attributes; 40% missing images; 8% missing descriptions
Label noise: estimated 3–7% due to seller mislabeling and policy changes

Success Criteria

Macro-F1 ≥ 0.72 on a time-based holdout (last 6 weeks)
Recall ≥ 0.90 for the 6 “high-risk” categories (regulated items), measured per-class
p95 online inference latency ≤ 120 ms per listing (text+metadata only; images are best-effort)
Must provide auditable explanations for compliance reviewers (at least at the feature/token level)

Constraints

Training budget: up to 4× A100 GPUs for 48 hours (or equivalent)
Serving: CPU-only is preferred; GPU serving is allowed only if it materially improves high-risk recall
Must handle multilingual inputs and taxonomy drift (new brands, new seller behaviors)
Model refresh: at least weekly, with a rollback plan

Deliverables (what you must produce in the interview)

A recommendation: pre-trained vs from-scratch, or a hybrid, with a clear decision framework.
A proposed modeling approach (architecture + features) and how you’ll handle imbalance, missingness, and label noise.
An evaluation plan: splits, metrics, and how you’ll tune thresholds for high-risk classes.
A production plan: latency strategy, monitoring, retraining cadence, and safe rollout.
A brief risk assessment: failure modes and mitigations (policy changes, adversarial sellers, drift).

Business Context

The key decision: use a pre-trained model (transfer learning) vs train a model from scratch. You must propose and justify an approach, including a plan to evaluate trade-offs and deploy safely.

Dataset

Each listing has a title, description, and structured metadata. Labels come from a combination of seller-provided category + human review; label noise is non-trivial.

Feature Group	Type	Approx. Columns	Examples	Notes
Text	unstructured	2	title, description	multilingual (EN/DE/FR/ES), average 35 tokens title, 180 tokens description
Taxonomy metadata	categorical	6	seller_category_path, brand, country, seller_tier	high cardinality brand (~400K)
Numeric	numerical	10	price, weight, dimensions, battery_capacity_mah	missingness varies by category
Images (optional)	unstructured	1-6 per item	main_image_url(s)	only 60% of listings have images at creation time

Size: ~8.5M labeled listings over 18 months
Target: 42-way multiclass compliance category
Class balance: long tail — top 5 classes = 62% of data; bottom 10 classes each <0.3%
Missing data: 20% missing numeric shipping attributes; 40% missing images; 8% missing descriptions
Label noise: estimated 3–7% due to seller mislabeling and policy changes

Success Criteria

Macro-F1 ≥ 0.72 on a time-based holdout (last 6 weeks)
Recall ≥ 0.90 for the 6 “high-risk” categories (regulated items), measured per-class
p95 online inference latency ≤ 120 ms per listing (text+metadata only; images are best-effort)
Must provide auditable explanations for compliance reviewers (at least at the feature/token level)

Constraints

Training budget: up to 4× A100 GPUs for 48 hours (or equivalent)
Serving: CPU-only is preferred; GPU serving is allowed only if it materially improves high-risk recall
Must handle multilingual inputs and taxonomy drift (new brands, new seller behaviors)
Model refresh: at least weekly, with a rollback plan

Deliverables (what you must produce in the interview)

A recommendation: pre-trained vs from-scratch, or a hybrid, with a clear decision framework.
A proposed modeling approach (architecture + features) and how you’ll handle imbalance, missingness, and label noise.
An evaluation plan: splits, metrics, and how you’ll tune thresholds for high-risk classes.
A production plan: latency strategy, monitoring, retraining cadence, and safe rollout.
A brief risk assessment: failure modes and mitigations (policy changes, adversarial sellers, drift).

Business Context

The key decision: use a pre-trained model (transfer learning) vs train a model from scratch. You must propose and justify an approach, including a plan to evaluate trade-offs and deploy safely.

Dataset

Each listing has a title, description, and structured metadata. Labels come from a combination of seller-provided category + human review; label noise is non-trivial.

Feature Group	Type	Approx. Columns	Examples	Notes
Text	unstructured	2	title, description	multilingual (EN/DE/FR/ES), average 35 tokens title, 180 tokens description
Taxonomy metadata	categorical	6	seller_category_path, brand, country, seller_tier	high cardinality brand (~400K)
Numeric	numerical	10	price, weight, dimensions, battery_capacity_mah	missingness varies by category
Images (optional)	unstructured	1-6 per item	main_image_url(s)	only 60% of listings have images at creation time

Size: ~8.5M labeled listings over 18 months
Target: 42-way multiclass compliance category
Class balance: long tail — top 5 classes = 62% of data; bottom 10 classes each <0.3%
Missing data: 20% missing numeric shipping attributes; 40% missing images; 8% missing descriptions
Label noise: estimated 3–7% due to seller mislabeling and policy changes

Success Criteria

Macro-F1 ≥ 0.72 on a time-based holdout (last 6 weeks)
Recall ≥ 0.90 for the 6 “high-risk” categories (regulated items), measured per-class
p95 online inference latency ≤ 120 ms per listing (text+metadata only; images are best-effort)
Must provide auditable explanations for compliance reviewers (at least at the feature/token level)

Constraints

Training budget: up to 4× A100 GPUs for 48 hours (or equivalent)
Serving: CPU-only is preferred; GPU serving is allowed only if it materially improves high-risk recall
Must handle multilingual inputs and taxonomy drift (new brands, new seller behaviors)
Model refresh: at least weekly, with a rollback plan

Deliverables (what you must produce in the interview)

A recommendation: pre-trained vs from-scratch, or a hybrid, with a clear decision framework.
A proposed modeling approach (architecture + features) and how you’ll handle imbalance, missingness, and label noise.
An evaluation plan: splits, metrics, and how you’ll tune thresholds for high-risk classes.
A production plan: latency strategy, monitoring, retraining cadence, and safe rollout.
A brief risk assessment: failure modes and mitigations (policy changes, adversarial sellers, drift).

Business Context

The key decision: use a pre-trained model (transfer learning) vs train a model from scratch. You must propose and justify an approach, including a plan to evaluate trade-offs and deploy safely.

Dataset

Each listing has a title, description, and structured metadata. Labels come from a combination of seller-provided category + human review; label noise is non-trivial.

Feature Group	Type	Approx. Columns	Examples	Notes
Text	unstructured	2	title, description	multilingual (EN/DE/FR/ES), average 35 tokens title, 180 tokens description
Taxonomy metadata	categorical	6	seller_category_path, brand, country, seller_tier	high cardinality brand (~400K)
Numeric	numerical	10	price, weight, dimensions, battery_capacity_mah	missingness varies by category
Images (optional)	unstructured	1-6 per item	main_image_url(s)	only 60% of listings have images at creation time

Size: ~8.5M labeled listings over 18 months
Target: 42-way multiclass compliance category
Class balance: long tail — top 5 classes = 62% of data; bottom 10 classes each <0.3%
Missing data: 20% missing numeric shipping attributes; 40% missing images; 8% missing descriptions
Label noise: estimated 3–7% due to seller mislabeling and policy changes

Success Criteria

Macro-F1 ≥ 0.72 on a time-based holdout (last 6 weeks)
Recall ≥ 0.90 for the 6 “high-risk” categories (regulated items), measured per-class
p95 online inference latency ≤ 120 ms per listing (text+metadata only; images are best-effort)
Must provide auditable explanations for compliance reviewers (at least at the feature/token level)

Constraints

Training budget: up to 4× A100 GPUs for 48 hours (or equivalent)
Serving: CPU-only is preferred; GPU serving is allowed only if it materially improves high-risk recall
Must handle multilingual inputs and taxonomy drift (new brands, new seller behaviors)
Model refresh: at least weekly, with a rollback plan

Deliverables (what you must produce in the interview)

A recommendation: pre-trained vs from-scratch, or a hybrid, with a clear decision framework.
A proposed modeling approach (architecture + features) and how you’ll handle imbalance, missingness, and label noise.
An evaluation plan: splits, metrics, and how you’ll tune thresholds for high-risk classes.
A production plan: latency strategy, monitoring, retraining cadence, and safe rollout.
A brief risk assessment: failure modes and mitigations (policy changes, adversarial sellers, drift).

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce in the interview)

Pretrained vs Scratch Product Classification

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce in the interview)

Your Answer

Pretrained vs Scratch Product Classification

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce in the interview)

Pretrained vs Scratch Product Classification

Business Context

Dataset

Success Criteria

Constraints

Deliverables (what you must produce in the interview)

Your Answer