Business Context
Ancestry Marketing receives large volumes of inbound text from campaigns across email replies, paid social lead forms, and customer feedback channels. The team wants an NLP pipeline that automatically classifies each message so it can route genealogy interest, DNA kit support, subscription questions, and complaint-related content to the right workflow.
Data
- Volume: ~1.8M historical labeled messages, plus ~40K new messages per day
- Text length: 5-400 words, median 42 words
- Language: 94% English, 4% Spanish, 2% mixed/other
- Labels: 6 classes —
DNA_Kit_Support, Subscription_Billing, Family_History_Interest, Promotion_Response, Complaint, Other
- Distribution: Highly imbalanced;
Promotion_Response and Other dominate, while Complaint is under 7%
- Noise: HTML fragments, signatures, tracking text, emojis, misspellings, duplicate submissions
Success Criteria
A production-ready solution should achieve macro-F1 >= 0.84, Complaint recall >= 0.90, and support batch or near-real-time scoring for Ancestry Marketing routing with p95 inference latency under 120 ms per message.
Constraints
- Must run in Ancestry’s secure environment; customer text cannot be sent to external APIs
- Solution should be explainable enough for marketing operations review
- Daily retraining should be feasible on a single GPU or CPU-heavy fallback path
Requirements
- Design an end-to-end NLP pipeline for multi-class text classification.
- Define preprocessing for noisy marketing text, multilingual edge cases, and duplicate handling.
- Implement a strong baseline and a transformer-based production candidate in Python.
- Explain how you would address class imbalance, thresholding, and label drift.
- Describe your training, validation, and test strategy.
- Specify monitoring, error analysis, and deployment considerations for Ancestry Marketing.