DeepSeek-V3: Enhancing Expert Routing with Sigmoid Function in Mixture of Experts Architecture

In DeepSeek-V3, the sigmoid function plays a crucial role in the calculation of affinity scores for expert routing in the Mixture of Experts (MoE) architecture. Unlike traditional MoE models that often use the softmax function to normalize affinity scores, DeepSeek-V3 employs the sigmoid function. This change impacts the model in several ways:

1. Normalization and Routing: The sigmoid function is used to compute the affinity scores, which are then normalized among all selected affinity scores to produce the gating values. This approach allows for a more flexible and nuanced routing mechanism compared to softmax, which can sometimes lead to routing collapse where certain experts are overly favored[4][7].

2. Avoiding Routing Collapse: Routing collapse occurs when most tokens are routed to a small subset of experts, leading to inefficient use of computational resources. DeepSeek-V3 mitigates this by using sigmoid gating and introducing bias terms that dynamically adjust during training. These bias terms help balance the load across experts without relying on auxiliary losses that can negatively impact model performance[4][9].

3. Bias Terms and Dynamic Adjustment: The model incorporates bias terms for each expert, which are added to the affinity scores before selecting the top-K experts. These bias terms are dynamically adjusted based on the load of each expert. If an expert is overloaded, its bias term decreases, and if it's underloaded, the bias term increases. This ensures a balanced distribution of tokens across experts without the need for additional losses[4][8].

4. Complementary Sequence-Wise Auxiliary Loss: Although DeepSeek-V3 primarily avoids auxiliary losses, it includes a small sequence-wise balance loss to prevent extreme cases where a single sequence heavily favors a small subset of experts. This loss acts as a safeguard without significantly impacting overall training dynamics[4].

5. Node-Limited Routing: To control communication costs, DeepSeek-V3 employs node-limited routing, where each token is sent to at most M nodes based on the highest affinity scores. This strategy enables near-full computation-communication overlap during training, enhancing efficiency[4].

Overall, the use of the sigmoid function in DeepSeek-V3 allows for a more flexible and efficient routing mechanism, contributing to the model's ability to balance expert utilization without sacrificing performance.

Citations:
[1] https://www.linkedin.com/posts/srijanie-dey_aibyhand-deeplearning-neuralnetworks-activity-7291477904792657920-ryE_
[2] https://community.aws/content/2rJj1WkztSfYwVfsIibhWxeqMf1/four-unique-takeaways-from-deepseek-v3?lang=en
[3] https://docs.openvino.ai/2025/notebooks/yolov11-keypoint-detection-with-output.html
[4] https://machinelearningatscale.substack.com/p/deepseek-v3-model
[5] https://ai.gopubby.com/deepseek-v3-explained-3-auxiliary-loss-free-load-balancing-4beeb734ab1f
[6] https://neurips.cc/virtual/2024/poster/96407
[7] https://www.gdsprs.com/bbs/board.php?bo_table=free&wr_id=2559&sst=wr_hit&sod=desc&sop=and&page=147&device=pc
[8] https://www.mlsys.ai/papers/deepseek_v3.html
[9] https://gonzoml.substack.com/p/deepseek-v3-technical-details

How does the sigmoid function impact the affinity score calculation in DeepSeek-V3