Business Context
Meta wants to improve lightweight text understanding for Facebook Feed ranking features derived from post text and comment context. You are asked to explain self-attention mathematically, implement it correctly, and reason about whether it can meet production latency constraints.
Dataset
You are given a supervised learning dataset built from public-post text embeddings and engagement labels used for an offline ranking proxy.
| Feature Group | Count | Examples |
|---|
| Token IDs | 1 sequence | wordpiece token ids for post text, hashtags, and short comment context |
| Attention masks | 1 sequence | valid-token mask, padding mask |
| Dense metadata | 12 | language id, device family, author follower bucket, post age bucket |
| Target | 1 | binary high-engagement label |
- Size: 2.4M examples, max sequence length 256, vocabulary size 50K
- Target: Binary — high engagement in the next 24 hours (1) vs not high engagement (0)
- Class balance: 18% positive, 82% negative
- Missing data: ~6% missing in metadata features; text is always present but sequence lengths vary heavily
Success Criteria
A strong solution should:
- derive the self-attention equations clearly, including query, key, value projections and softmax normalization
- explain why scaled dot-product attention uses the factor $1/\sqrt{d_k}$
- quantify time and memory complexity with respect to sequence length and hidden size
- implement a correct attention module and train a small classifier baseline
- discuss when self-attention becomes impractical for long Facebook Feed sequences and what approximations are reasonable
Constraints
- P95 online inference budget for the text encoder is under 20 ms on a single production GPU batch
- The solution should support variable-length sequences with padding masks
- Researchers care about correctness and scaling behavior more than leaderboard performance
Deliverables
- Derive scaled dot-product self-attention mathematically, including tensor dimensions.
- Analyze computational and memory complexity for single-head and multi-head attention.
- Implement a PyTorch attention-based classifier with masking.
- Evaluate against a mean-pooled embedding baseline.
- Recommend whether this architecture is suitable for Facebook Feed ranking under the stated latency constraints.