What is knowledge distillation?

Answer

Knowledge distillation trains a small student model to mimic a large, pre-trained teacher model. Instead of training on hard labels (one-hot), the student learns from the teacher's soft labels (probability distributions over all classes), which contain richer information — the teacher's uncertainty and inter-class similarities. The student loss = αL_CE(hard labels) + (1-α)L_KL(soft labels). A temperature parameter T softens the teacher's distribution to amplify small probabilities. Applications: model compression (BERT → DistilBERT), on-device inference, and multi-teacher distillation for NLP.