What is the difference between model parallelism and data parallelism?

Question

Accepted Answer

Data parallelism splits the training data across multiple GPUs; each GPU holds a full copy of the model and processes a mini-batch. Gradients are aggregated across GPUs (all-reduce). Works well when the model fits on a single GPU. Model parallelism splits the model itself across multiple GPUs — different layers or parts of the model run on different devices. Needed when the model is too large for a single GPU. Tensor parallelism splits individual matrices across GPUs (used in Megatron-LM). Pip

What is the difference between model parallelism and data parallelism?

Answer

More Machine Learning / AI Questions