You’re on the ML Platform team at MercuryMart, a global e-commerce marketplace with 35M monthly active users and ~120M active SKUs. A new regulatory policy in the EU requires the company to correctly classify products into a restricted set of 42 compliance categories (e.g., medical devices, children’s products, lithium batteries) to drive downstream workflows: listing eligibility, shipping restrictions, and mandatory disclosures.
Misclassification has real cost: false negatives can trigger regulatory fines and forced delistings, while false positives reduce seller revenue and harm trust. The business wants a model in production within 8 weeks, and the system must support near-real-time classification for newly created listings.
The key decision: use a pre-trained model (transfer learning) vs train a model from scratch. You must propose and justify an approach, including a plan to evaluate trade-offs and deploy safely.
Each listing has a title, description, and structured metadata. Labels come from a combination of seller-provided category + human review; label noise is non-trivial.
| Feature Group | Type | Approx. Columns | Examples | Notes |
|---|---|---|---|---|
| Text | unstructured | 2 | title, description | multilingual (EN/DE/FR/ES), average 35 tokens title, 180 tokens description |
| Taxonomy metadata | categorical | 6 | seller_category_path, brand, country, seller_tier | high cardinality brand (~400K) |
| Numeric | numerical | 10 | price, weight, dimensions, battery_capacity_mah | missingness varies by category |
| Images (optional) | unstructured | 1-6 per item | main_image_url(s) | only 60% of listings have images at creation time |