What is Mixture of Experts (MoE)?

Architecture

Mixture of Experts (MoE)

A model architecture that uses multiple specialized sub-networks ("experts") and a routing mechanism that activates only a subset of them for each input, enabling larger total model capacity without proportional increases in computation.

Mixture of experts (MoE) is an architecture where a model contains multiple parallel "expert" sub-networks, and a learned routing mechanism selects which experts to activate for each input token. Instead of every token passing through all the model's parameters (as in dense transformers), MoE models might route each token to only 2 of 8 experts. This means the total parameter count can be very large while the active parameters per forward pass remain manageable.

Mixtral 8x7B, from Mistral AI, is one of the most well-known MoE models. It has roughly 47 billion total parameters but only activates about 13 billion for each token, achieving performance comparable to much larger dense models while being significantly faster for inference. GPT-4 is widely reported to use an MoE architecture as well, with estimates of 8 experts and over a trillion total parameters. Google's Switch Transformer research explored MoE at massive scales with thousands of experts.

The routing mechanism is the critical component of MoE architectures. A small neural network (the "router" or "gate") examines each token and assigns it to the most relevant experts. Good routing ensures that different experts specialize in different types of content — one might handle mathematical reasoning while another focuses on creative writing. If routing is poorly balanced, some experts become overloaded while others go unused, reducing efficiency. Load balancing losses and auxiliary training objectives help prevent this.

MoE models offer a compelling cost-performance tradeoff. They can match or exceed the quality of dense models with the same active parameter count while having much higher total capacity (which translates to more stored knowledge). The main challenges are higher memory requirements (all expert parameters must be loaded, even if only some are active per token), more complex deployment, and potential inconsistency if different experts give conflicting signals. Despite these challenges, MoE is increasingly the architecture of choice for frontier models where both quality and efficiency matter.

Latency

Multimodal

Explore more AI concepts in the glossary

Browse Full Glossary