The main differences between the expert routing mechanisms in DeepSeek-V2 and DeepSeek-V3 can be summarized as follows:
DeepSeek-V2 Expert Routing
- Device-Limited Routing Mechanism: DeepSeek-V2 employs a device-limited routing mechanism to distribute experts across multiple devices. This approach ensures that the target experts for each token are spread across a limited number of devices, typically selecting the top-K experts from these devices. This strategy helps manage communication overhead and ensures efficient parallel processing[1][5].
- Auxiliary Losses for Load Balance: DeepSeek-V2 introduces three types of auxiliary lossesâexpert-level, device-level, and communication-levelâto maintain load balance during training. These losses help prevent routing collapse by ensuring that no single expert is overly utilized while others remain underutilized[1][6].
- Number of Experts and Activation: DeepSeek-V2 has 160 experts plus two shared experts, with only six experts activated during inference. This selective activation significantly reduces the number of active parameters, making the model more efficient[5].
DeepSeek-V3 Expert Routing
- Increased Expert Specialization: DeepSeek-V3 builds upon the MoE architecture by increasing the number of routed experts per layer by 60%, from 160 to 256. This increase enhances the model's capacity for knowledge and memory[2].
- Shared Experts: DeepSeek-V3 retains the concept of shared experts, which are always activated. Each Feed-Forward Network (FFN) layer has one shared expert, and there are three layers where all experts are activated, improving the model's ability to capture common knowledge across contexts[2][4].
- Token-to-Expert Affinity: The assignment of tokens to experts is based on token-to-expert affinity in the embedding space. However, DeepSeek-V3 faces challenges related to routing collapse, where tokens may consistently be routed to the same experts, potentially hindering the training of other experts[2].
- Aggressive MoE Strategy: DeepSeek-V3 adopts a more aggressive MoE strategy, utilizing FP8 precision for training, which allows for more efficient computation and scaling. This approach enables the model to leverage sparse activation effectively, optimizing the use of parameters during inference[2][4].
In summary, while both models use MoE architectures for efficient routing and sparse activation, DeepSeek-V3 enhances this approach with increased expert specialization, more aggressive MoE strategies, and adjustments to shared expert configurations. DeepSeek-V2 focuses on economical training and efficient inference through device-limited routing and load-balancing auxiliary losses.
Citations:
[1] https://arxiv.org/pdf/2405.04434.pdf
[2] https://fireworks.ai/blog/deepseek-model-architecture
[3] https://www.chipstrat.com/p/deepseek-moe-and-v2
[4] https://dirox.com/post/deepseek-v3-the-open-source-ai-revolution
[5] https://thesalt.substack.com/p/deepseek-v2-a-huge-llm-with-efficient
[6] https://stratechery.com/2025/deepseek-faq/
[7] https://arxiv.org/html/2405.04434v3
[8] https://www.youtube.com/watch?v=4ucnsFBQmDA