Business Context
BlackRock’s Aladdin platform ingests large volumes of internal research notes, issuer commentary, and market updates. You need to use word embeddings to build a language understanding system that automatically classifies each document into one of several investment themes so analysts can search and route content more efficiently.
Data
- Volume: 850,000 historical documents, with ~120,000 labeled for supervised training
- Text length: 30-900 words per document, median 180 words
- Language: Primarily English financial text with ticker symbols, issuer names, macro terms, and abbreviations
- Labels: 6 themes —
credit-risk, equities, macro, esg, liquidity, operations
- Distribution: Imbalanced;
macro and equities make up ~55% of labeled data, while liquidity is under 8%
Success Criteria
A good solution should achieve macro-F1 >= 0.82, recall >= 0.78 on minority classes, and support batch inference for 100,000 daily documents. The approach should show how embeddings improve semantic understanding beyond bag-of-words baselines.
Constraints
- Must run in BlackRock’s controlled environment with no external API calls
- Inference should remain practical on a single GPU or CPU batch job
- The solution should be explainable enough for model risk review
Requirements
- Design an embedding-based NLP pipeline for multi-class document understanding.
- Explain how you would preprocess financial text, including tickers, numbers, and domain abbreviations.
- Compare at least one static embedding approach with one contextual embedding approach.
- Implement a modern Python solution for training and evaluation.
- Describe how you would handle class imbalance and out-of-vocabulary financial terms.
- Propose an error analysis plan for confusing themes such as
macro vs credit-risk.