What is quantization in deep learning?

Answer

Quantization reduces the precision of model weights and/or activations from 32-bit floating point (FP32) to lower-precision formats (INT8, INT4, FP16). This reduces model size (4× for INT8 vs FP32) and speeds up inference significantly on hardware with integer arithmetic support. Post-Training Quantization (PTQ) quantizes a trained model with minimal calibration data. Quantization-Aware Training (QAT) simulates quantization during training, producing models that are more robust to precision reduction. GPTQ and bitsandbytes enable 4-bit quantization of LLMs for deployment on consumer GPUs.