
You are working with a transformer model and need to explain how self-attention turns token embeddings into contextual representations. The discussion should stay at the level of the math and the tensor operations, not a product or data pipeline.
How would you explain the self-attention mechanism mathematically?