What is the mixture of experts (MoE) architecture?

Answer

A Mixture of Experts (MoE) model consists of multiple "expert" sub-networks and a router (gating network) that selects which experts process each input token. Only a small subset of experts (e.g., 2 out of 64) are activated per token — this is conditional computation. MoE allows dramatic scaling of model capacity without proportional increases in computation. Switch Transformer (Google) demonstrated that sparse MoE Transformers can be more efficient than dense models. Mixtral 8x7B and rumored GPT-4 architecture use MoE. Challenges: load balancing (ensuring all experts are used), training instability, and communication overhead in distributed settings.

What are graph neural networks (GNNs)?

What is position encoding in Transformers?

More Machine Learning / AI Questions

View all →

Advanced What is the Transformer self-attention mechanism in detail?
Advanced What is RLHF (Reinforcement Learning from Human Feedback)?
Advanced What is the difference between model parallelism and data parallelism?
Advanced What is a diffusion model?
Advanced What is LoRA (Low-Rank Adaptation)?

All Machine Learning / AI Questions Browse All Topics