Business Context
Ramp ingests large volumes of raw merchant descriptions from card transactions, receipts, and memo fields. The marketing analytics team wants a tokenization strategy that turns this noisy text into reliable features for downstream NLP tasks such as merchant categorization, campaign analysis, and spend-intent classification.
Data
- Volume: ~2 million historical transaction text records, with ~80,000 new records per day
- Text sources: Merchant descriptors, receipt line items, user-entered memos, and category notes from Ramp transactions
- Text length: 3-120 tokens per record; median length is 14 tokens
- Language: Primarily English, with some multilingual merchant names and abbreviations
- Label distribution: If used for downstream classification, categories are imbalanced (e.g., travel, software, meals, office supplies, other)
- Noise: All caps, punctuation, IDs, timestamps, card suffixes, repeated whitespace, OCR artifacts, and inconsistent abbreviations
Success Criteria
A good solution should preserve business-relevant terms (for example, merchant names, product words, and spend signals), reduce noise, and improve downstream model quality versus naive whitespace splitting. The pipeline should be reproducible and easy to maintain.
Constraints
- Must support batch processing and low-latency inference for new Ramp transactions
- Should be interpretable enough for analytics stakeholders to inspect tokens
- Avoid overly aggressive normalization that removes merchant-specific meaning
Requirements
- Design a tokenization pipeline for noisy Ramp transaction text.
- Explain when you would use word, subword, or character-level tokenization.
- Implement preprocessing and tokenization in Python.
- Show how tokenized output would feed a simple downstream classifier.
- Describe how you would evaluate whether the tokenization strategy is effective.