What is the softmax function?

Answer

The softmax function converts a vector of raw scores (logits) into a probability distribution where all values are between 0 and 1 and sum to 1. For input vector z, softmax(zᵢ) = e^zᵢ / Σe^zⱼ. It amplifies differences between values — the largest logit gets the highest probability. Used in the output layer of multi-class classifiers, combined with categorical cross-entropy loss. The temperature parameter τ in softmax(z/τ) controls distribution sharpness: low τ → more peaked, high τ → more uniform (used in knowledge distillation and language model sampling).