株式会社オブライト
AI2026-05-17

MoE (Mixture of Experts)

Also known as: MoE / Mixture of Experts / 混合エキスパート

A model architecture with multiple specialized sub-networks (experts) where only a sparse subset is activated per token, allowing parameter counts to scale without proportionally increasing compute.


Overview

MoE places multiple specialized feed-forward sub-networks (experts) inside the Transformer, with a router selecting only a few experts per token. GPT-4, Mixtral, and Rakuten AI 3.0 use MoE. Total parameter count is large but activated parameters per token are small, keeping per-token compute lower than a comparably capable dense model.

MoE vs Dense Model

MoE achieves better FLOPs efficiency than dense at equivalent capability, but requires all experts to reside in memory simultaneously, raising VRAM requirements. Mistral Small 4 (119B MoE) illustrates this tradeoff: strong capability with high total parameter count.

Related Columns

Related Terms

Feel free to contact us

Contact Us