In DeepSeek-V3, the affinity score plays a crucial role in the expert selection process within its Mixture-of-Experts (MoE) architecture. This architecture is designed to efficiently handle large-scale language modeling tasks by dynamically activating a subset of experts based on the input tokens.
Affinity Score Calculation
The affinity score is calculated as the dot product of the input token's embedding and a specific expert's centroid. The centroid can be thought of as a representative vector for each expert, which might be derived from the average activations or inputs that the expert processes. This dot product measures how closely aligned the token is with the expertise of each available expert.
Expert Selection Process
1. Top-K Routing: For each input token, DeepSeek-V3 selects the top 8 experts with the highest affinity scores. This process is known as top-K routing, where K is fixed at 8 in this case[1][7].
2. Bias Adjustment: To prevent routing collapse, where too many tokens are routed to the same experts, DeepSeek-V3 introduces a dynamic bias adjustment. Each expert has a bias term $$b_i$$ that is added to its affinity score during routing. If an expert is overloaded, its bias is decreased, and if it is underutilized, its bias is increased. This adjustment helps maintain a balanced workload across all experts without using explicit auxiliary losses[1][3].
3. Gating Mechanism: The gating mechanism calculates a score for each token and selects the most relevant routed experts based on these scores. This ensures that the model efficiently allocates computational resources by only activating the necessary experts for each token[3].
Benefits of the Affinity Score
- Efficiency: By selecting experts based on affinity scores, DeepSeek-V3 reduces computational costs by activating only a fraction of the model's total parameters for each token. This results in more efficient inference and training processes[4][8].
- Specialization: The affinity score allows for better specialization among experts. Each expert can focus on specific patterns or tasks, enhancing the model's overall representational power and ability to handle diverse inputs[1][2].
- Stability: The dynamic bias adjustment ensures that no expert is overloaded, maintaining stability during both training and inference. This prevents the model from dropping tokens due to overloading, which was a problem in earlier versions[2][3].
In summary, the affinity score in DeepSeek-V3 is crucial for dynamically selecting the most relevant experts for each input token, ensuring efficient and specialized processing while maintaining model stability.
Citations:
[1] https://gonzoml.substack.com/p/deepseek-v3-technical-details
[2] https://martinfowler.com/articles/deepseek-papers.html
[3] https://www.youtube.com/watch?v=Bv7cT-_SpQY
[4] https://www.linkedin.com/pulse/what-main-benefit-mixture-experts-moe-models-qi-he-nkgbe
[5] https://www.linkedin.com/pulse/unpacking-deepseek-v3-technical-innovations-question-cost-statton-juplc
[6] https://latenode.com/blog/deepseek-v3-and-deepseek-r1-integrations-are-now-on-latenode
[7] https://www.linkedin.com/pulse/dynamically-selecting-number-expert-moe-models-like-deepseek-rosi%C4%87-ckytf
[8] https://arxiv.org/html/2412.19437v1
[9] https://www.bentoml.com/blog/the-complete-guide-to-deepseek-models-from-v3-to-r1-and-beyond