🤖 Machine Learning / AI
Advanced
What is the Transformer self-attention mechanism in detail?
Answer
Self-attention computes attention within a single sequence. Each token is projected into three vectors: Query (Q), Key (K), and Value (V). Attention scores are computed as softmax(QKᵀ / √d_k) × V, where d_k is the key dimension (the scaling prevents vanishing gradients from large dot products). Multi-head attention runs h parallel attention functions with different learned projections, then concatenates and projects the results, allowing the model to attend to information from different representation subspaces simultaneously. This is why Transformers outperform RNNs on long sequences.