What is the mixture of experts (MoE) architecture?

Answer

A Mixture of Experts (MoE) model consists of multiple "expert" sub-networks and a router (gating network) that selects which experts process each input token. Only a small subset of experts (e.g., 2 out of 64) are activated per token — this is conditional computation. MoE allows dramatic scaling of model capacity without proportional increases in computation. Switch Transformer (Google) demonstrated that sparse MoE Transformers can be more efficient than dense models. Mixtral 8x7B and rumored GPT-4 architecture use MoE. Challenges: load balancing (ensuring all experts are used), training instability, and communication overhead in distributed settings.