🤖 Machine Learning / AI
Advanced
What is the role of the KV cache in LLM inference?
Answer
During autoregressive generation, each new token must attend to all previous tokens. Recomputing key (K) and value (V) matrices for all past tokens at each step would be O(n²) computation. The KV cache stores the K and V matrices of all previously processed tokens, so only the new token's K/V need to be computed at each step. This reduces generation from O(n²) to O(n). KV cache memory grows linearly with sequence length and batch size, and is a primary bottleneck for LLM serving. Techniques like multi-query attention (MQA), grouped-query attention (GQA), and PagedAttention (vLLM) reduce KV cache memory requirements.