What is position encoding in Transformers?
Answer
Unlike RNNs, Transformers process all tokens in parallel and have no inherent notion of position. Positional encoding injects position information into token embeddings. The original Transformer used fixed sinusoidal encodings of varying frequencies, allowing generalization to longer sequences. Learned absolute position embeddings (GPT-2, BERT) are trainable but do not generalize beyond training length. Relative position encodings (T5, Transformer-XL) encode relative distances between tokens. RoPE (Rotary Position Embedding) encodes positions by rotating query/key vectors — used in LLaMA, Mistral, and most modern LLMs for its ability to extrapolate to longer contexts.
Previous
What is the mixture of experts (MoE) architecture?
Next
What is the role of the KV cache in LLM inference?