Analyze Self-Attention for Feed Ranking

Business Context

Meta wants to improve lightweight text understanding for Facebook Feed ranking features derived from post text and comment context. You are asked to explain self-attention mathematically, implement it correctly, and reason about whether it can meet production latency constraints.

Dataset

You are given a supervised learning dataset built from public-post text embeddings and engagement labels used for an offline ranking proxy.

Feature Group	Count	Examples
Token IDs	1 sequence	wordpiece token ids for post text, hashtags, and short comment context
Attention masks	1 sequence	valid-token mask, padding mask
Dense metadata	12	language id, device family, author follower bucket, post age bucket
Target	1	binary high-engagement label

Size: 2.4M examples, max sequence length 256, vocabulary size 50K
Target: Binary — high engagement in the next 24 hours (1) vs not high engagement (0)
Class balance: 18% positive, 82% negative
Missing data: ~6% missing in metadata features; text is always present but sequence lengths vary heavily

Success Criteria

A strong solution should:

derive the self-attention equations clearly, including query, key, value projections and softmax normalization
explain why scaled dot-product attention uses the factor $1/\sqrt{d_k}$
quantify time and memory complexity with respect to sequence length and hidden size
implement a correct attention module and train a small classifier baseline
discuss when self-attention becomes impractical for long Facebook Feed sequences and what approximations are reasonable

Constraints

P95 online inference budget for the text encoder is under 20 ms on a single production GPU batch
The solution should support variable-length sequences with padding masks
Researchers care about correctness and scaling behavior more than leaderboard performance

Deliverables

Derive scaled dot-product self-attention mathematically, including tensor dimensions.
Analyze computational and memory complexity for single-head and multi-head attention.
Implement a PyTorch attention-based classifier with masking.
Evaluate against a mean-pooled embedding baseline.
Recommend whether this architecture is suitable for Facebook Feed ranking under the stated latency constraints.

Business Context

Dataset

You are given a supervised learning dataset built from public-post text embeddings and engagement labels used for an offline ranking proxy.

Feature Group	Count	Examples
Token IDs	1 sequence	wordpiece token ids for post text, hashtags, and short comment context
Attention masks	1 sequence	valid-token mask, padding mask
Dense metadata	12	language id, device family, author follower bucket, post age bucket
Target	1	binary high-engagement label

Size: 2.4M examples, max sequence length 256, vocabulary size 50K
Target: Binary — high engagement in the next 24 hours (1) vs not high engagement (0)
Class balance: 18% positive, 82% negative
Missing data: ~6% missing in metadata features; text is always present but sequence lengths vary heavily

Success Criteria

A strong solution should:

derive the self-attention equations clearly, including query, key, value projections and softmax normalization
explain why scaled dot-product attention uses the factor $1/\sqrt{d_k}$
quantify time and memory complexity with respect to sequence length and hidden size
implement a correct attention module and train a small classifier baseline
discuss when self-attention becomes impractical for long Facebook Feed sequences and what approximations are reasonable

Constraints

P95 online inference budget for the text encoder is under 20 ms on a single production GPU batch
The solution should support variable-length sequences with padding masks
Researchers care about correctness and scaling behavior more than leaderboard performance

Deliverables

Derive scaled dot-product self-attention mathematically, including tensor dimensions.
Analyze computational and memory complexity for single-head and multi-head attention.
Implement a PyTorch attention-based classifier with masking.
Evaluate against a mean-pooled embedding baseline.
Recommend whether this architecture is suitable for Facebook Feed ranking under the stated latency constraints.

Business Context

Dataset

You are given a supervised learning dataset built from public-post text embeddings and engagement labels used for an offline ranking proxy.

Feature Group	Count	Examples
Token IDs	1 sequence	wordpiece token ids for post text, hashtags, and short comment context
Attention masks	1 sequence	valid-token mask, padding mask
Dense metadata	12	language id, device family, author follower bucket, post age bucket
Target	1	binary high-engagement label

Size: 2.4M examples, max sequence length 256, vocabulary size 50K
Target: Binary — high engagement in the next 24 hours (1) vs not high engagement (0)
Class balance: 18% positive, 82% negative
Missing data: ~6% missing in metadata features; text is always present but sequence lengths vary heavily

Success Criteria

A strong solution should:

derive the self-attention equations clearly, including query, key, value projections and softmax normalization
explain why scaled dot-product attention uses the factor $1/\sqrt{d_k}$
quantify time and memory complexity with respect to sequence length and hidden size
implement a correct attention module and train a small classifier baseline
discuss when self-attention becomes impractical for long Facebook Feed sequences and what approximations are reasonable

Constraints

P95 online inference budget for the text encoder is under 20 ms on a single production GPU batch
The solution should support variable-length sequences with padding masks
Researchers care about correctness and scaling behavior more than leaderboard performance

Deliverables

Derive scaled dot-product self-attention mathematically, including tensor dimensions.
Analyze computational and memory complexity for single-head and multi-head attention.
Implement a PyTorch attention-based classifier with masking.
Evaluate against a mean-pooled embedding baseline.
Recommend whether this architecture is suitable for Facebook Feed ranking under the stated latency constraints.

Business Context

Dataset

You are given a supervised learning dataset built from public-post text embeddings and engagement labels used for an offline ranking proxy.

Feature Group	Count	Examples
Token IDs	1 sequence	wordpiece token ids for post text, hashtags, and short comment context
Attention masks	1 sequence	valid-token mask, padding mask
Dense metadata	12	language id, device family, author follower bucket, post age bucket
Target	1	binary high-engagement label

Size: 2.4M examples, max sequence length 256, vocabulary size 50K
Target: Binary — high engagement in the next 24 hours (1) vs not high engagement (0)
Class balance: 18% positive, 82% negative
Missing data: ~6% missing in metadata features; text is always present but sequence lengths vary heavily

Success Criteria

A strong solution should:

derive the self-attention equations clearly, including query, key, value projections and softmax normalization
explain why scaled dot-product attention uses the factor $1/\sqrt{d_k}$
quantify time and memory complexity with respect to sequence length and hidden size
implement a correct attention module and train a small classifier baseline
discuss when self-attention becomes impractical for long Facebook Feed sequences and what approximations are reasonable

Constraints

P95 online inference budget for the text encoder is under 20 ms on a single production GPU batch
The solution should support variable-length sequences with padding masks
Researchers care about correctness and scaling behavior more than leaderboard performance

Deliverables

Derive scaled dot-product self-attention mathematically, including tensor dimensions.
Analyze computational and memory complexity for single-head and multi-head attention.
Implement a PyTorch attention-based classifier with masking.
Evaluate against a mean-pooled embedding baseline.
Recommend whether this architecture is suitable for Facebook Feed ranking under the stated latency constraints.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Analyze Self-Attention for Feed Ranking

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Analyze Self-Attention for Feed Ranking

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Analyze Self-Attention for Feed Ranking

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer