What is knowledge distillation?

Answer

Knowledge distillation trains a small student model to mimic a large, pre-trained teacher model. Instead of training on hard labels (one-hot), the student learns from the teacher's soft labels (probability distributions over all classes), which contain richer information — the teacher's uncertainty and inter-class similarities. The student loss = αL_CE(hard labels) + (1-α)L_KL(soft labels). A temperature parameter T softens the teacher's distribution to amplify small probabilities. Applications: model compression (BERT → DistilBERT), on-device inference, and multi-teacher distillation for NLP.

What is LoRA (Low-Rank Adaptation)?

What is neural architecture search (NAS)?

More Machine Learning / AI Questions

View all →

Advanced What is the Transformer self-attention mechanism in detail?
Advanced What is RLHF (Reinforcement Learning from Human Feedback)?
Advanced What is the difference between model parallelism and data parallelism?
Advanced What is a diffusion model?
Advanced What is LoRA (Low-Rank Adaptation)?

All Machine Learning / AI Questions Browse All Topics