What is the role of the KV cache in LLM inference?

Answer

During autoregressive generation, each new token must attend to all previous tokens. Recomputing key (K) and value (V) matrices for all past tokens at each step would be O(n²) computation. The KV cache stores the K and V matrices of all previously processed tokens, so only the new token's K/V need to be computed at each step. This reduces generation from O(n²) to O(n). KV cache memory grows linearly with sequence length and batch size, and is a primary bottleneck for LLM serving. Techniques like multi-query attention (MQA), grouped-query attention (GQA), and PagedAttention (vLLM) reduce KV cache memory requirements.

What is position encoding in Transformers?

More Machine Learning / AI Questions

View all →

Advanced What is the Transformer self-attention mechanism in detail?
Advanced What is RLHF (Reinforcement Learning from Human Feedback)?
Advanced What is the difference between model parallelism and data parallelism?
Advanced What is a diffusion model?
Advanced What is LoRA (Low-Rank Adaptation)?

All Machine Learning / AI Questions Browse All Topics