MoE (Mixture of Experts)
Also known as: MoE / Mixture of Experts / 混合エキスパート
A model architecture with multiple specialized sub-networks (experts) where only a sparse subset is activated per token, allowing parameter counts to scale without proportionally increasing compute.
Overview
MoE places multiple specialized feed-forward sub-networks (experts) inside the Transformer, with a router selecting only a few experts per token. GPT-4, Mixtral, and Rakuten AI 3.0 use MoE. Total parameter count is large but activated parameters per token are small, keeping per-token compute lower than a comparably capable dense model.
MoE vs Dense Model
MoE achieves better FLOPs efficiency than dense at equivalent capability, but requires all experts to reside in memory simultaneously, raising VRAM requirements. Mistral Small 4 (119B MoE) illustrates this tradeoff: strong capability with high total parameter count.
Related Columns
Related Terms
Feel free to contact us
Contact Us