🤖 Machine Learning / AI
Advanced
What is the difference between model parallelism and data parallelism?
Answer
Data parallelism splits the training data across multiple GPUs; each GPU holds a full copy of the model and processes a mini-batch. Gradients are aggregated across GPUs (all-reduce). Works well when the model fits on a single GPU. Model parallelism splits the model itself across multiple GPUs — different layers or parts of the model run on different devices. Needed when the model is too large for a single GPU. Tensor parallelism splits individual matrices across GPUs (used in Megatron-LM). Pipeline parallelism assigns sequential model stages to different GPUs. Modern LLM training uses 3D parallelism: data + tensor + pipeline in combination.