Classify Ancestry Marketing Messages

Business Context

Ancestry Marketing receives large volumes of inbound text from campaigns across email replies, paid social lead forms, and customer feedback channels. The team wants an NLP pipeline that automatically classifies each message so it can route genealogy interest, DNA kit support, subscription questions, and complaint-related content to the right workflow.

Data

Volume: ~1.8M historical labeled messages, plus ~40K new messages per day
Text length: 5-400 words, median 42 words
Language: 94% English, 4% Spanish, 2% mixed/other
Labels: 6 classes — DNA_Kit_Support, Subscription_Billing, Family_History_Interest, Promotion_Response, Complaint, Other
Distribution: Highly imbalanced; Promotion_Response and Other dominate, while Complaint is under 7%
Noise: HTML fragments, signatures, tracking text, emojis, misspellings, duplicate submissions

Success Criteria

A production-ready solution should achieve macro-F1 >= 0.84, Complaint recall >= 0.90, and support batch or near-real-time scoring for Ancestry Marketing routing with p95 inference latency under 120 ms per message.

Constraints

Must run in Ancestry’s secure environment; customer text cannot be sent to external APIs
Solution should be explainable enough for marketing operations review
Daily retraining should be feasible on a single GPU or CPU-heavy fallback path

Requirements

Design an end-to-end NLP pipeline for multi-class text classification.
Define preprocessing for noisy marketing text, multilingual edge cases, and duplicate handling.
Implement a strong baseline and a transformer-based production candidate in Python.
Explain how you would address class imbalance, thresholding, and label drift.
Describe your training, validation, and test strategy.
Specify monitoring, error analysis, and deployment considerations for Ancestry Marketing.

Business Context

Data

Volume: ~1.8M historical labeled messages, plus ~40K new messages per day

Text length: 5-400 words, median 42 words

Language: 94% English, 4% Spanish, 2% mixed/other

Labels: 6 classes — DNA_Kit_Support, Subscription_Billing, Family_History_Interest, Promotion_Response, Complaint, Other

Distribution: Highly imbalanced; Promotion_Response and Other dominate, while Complaint is under 7%

Noise: HTML fragments, signatures, tracking text, emojis, misspellings, duplicate submissions

Requirements

Design an end-to-end NLP pipeline for multi-class text classification.

Define preprocessing for noisy marketing text, multilingual edge cases, and duplicate handling.

Implement a strong baseline and a transformer-based production candidate in Python.

Explain how you would address class imbalance, thresholding, and label drift.

Describe your training, validation, and test strategy.

Specify monitoring, error analysis, and deployment considerations for Ancestry Marketing.

Business Context

Data

Volume: ~1.8M historical labeled messages, plus ~40K new messages per day

Text length: 5-400 words, median 42 words

Language: 94% English, 4% Spanish, 2% mixed/other

Labels: 6 classes — DNA_Kit_Support, Subscription_Billing, Family_History_Interest, Promotion_Response, Complaint, Other

Distribution: Highly imbalanced; Promotion_Response and Other dominate, while Complaint is under 7%

Noise: HTML fragments, signatures, tracking text, emojis, misspellings, duplicate submissions

Requirements

Design an end-to-end NLP pipeline for multi-class text classification.

Define preprocessing for noisy marketing text, multilingual edge cases, and duplicate handling.

Implement a strong baseline and a transformer-based production candidate in Python.

Explain how you would address class imbalance, thresholding, and label drift.

Describe your training, validation, and test strategy.

Specify monitoring, error analysis, and deployment considerations for Ancestry Marketing.

Business Context

Data

Volume: ~1.8M historical labeled messages, plus ~40K new messages per day

Text length: 5-400 words, median 42 words

Language: 94% English, 4% Spanish, 2% mixed/other

Labels: 6 classes — DNA_Kit_Support, Subscription_Billing, Family_History_Interest, Promotion_Response, Complaint, Other

Distribution: Highly imbalanced; Promotion_Response and Other dominate, while Complaint is under 7%

Noise: HTML fragments, signatures, tracking text, emojis, misspellings, duplicate submissions

Requirements

Design an end-to-end NLP pipeline for multi-class text classification.

Define preprocessing for noisy marketing text, multilingual edge cases, and duplicate handling.

Implement a strong baseline and a transformer-based production candidate in Python.

Explain how you would address class imbalance, thresholding, and label drift.

Describe your training, validation, and test strategy.

Specify monitoring, error analysis, and deployment considerations for Ancestry Marketing.

Interview Guides

Business Context

Data

Success Criteria

Constraints

Requirements

Classify Ancestry Marketing Messages

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer

Classify Ancestry Marketing Messages

Business Context

Data

Success Criteria

Constraints

Requirements

Classify Ancestry Marketing Messages

Business Context

Data

Success Criteria

Constraints

Requirements

Your Answer