Home Arrow Icon Knowledge base Arrow Icon Global Arrow Icon How does sigmoid gating help in preventing routing collapse in DeepSeek-V3


How does sigmoid gating help in preventing routing collapse in DeepSeek-V3


Sigmoid gating in DeepSeek-V3 plays a crucial role in preventing routing collapse, a common issue in Mixture-of-Experts (MoE) models where a few experts are consistently favored over others, leading to inefficient training and utilization of model resources. Here's how sigmoid gating helps:

Traditional Softmax Gating vs. Sigmoid Gating

Traditional MoE models often use softmax gating, which can lead to a "winner-takes-all" scenario. Softmax outputs are normalized to ensure they sum to 1, which can result in extreme probabilities where one expert is chosen almost exclusively, especially if its initial weights are slightly better. This can cause other experts to be underutilized and undertrained, leading to routing collapse.

In contrast, sigmoid gating assigns each expert a score between 0 and 1 independently, without normalization across experts. This means that multiple experts can have high scores simultaneously, allowing for a more balanced distribution of tokens across experts. Sigmoid gating does not enforce a strict competition among experts, reducing the likelihood of routing collapse by ensuring that each expert gets a fair chance to contribute[1][4][6].

Dynamic Bias Adjustment

DeepSeek-V3 further enhances sigmoid gating by introducing dynamic bias terms for each expert. These biases are adjusted during training based on the load of each expert. If an expert is overloaded, its bias is decreased to discourage further routing to it, while underloaded experts have their biases increased to attract more tokens. This dynamic adjustment helps maintain a balanced load across all experts, preventing any single expert from dominating the routing decisions and thus preventing routing collapse[2][4][6].

Hierarchical Gating

DeepSeek-V3 also employs hierarchical gating, which applies sparsity constraints at multiple levels. Initially, a coarse selection of experts is made, followed by finer filtering within selected groups. This hierarchical approach ensures that a diverse set of experts is activated for each token, further reducing the risk of routing collapse by preventing over-specialization and encouraging generalization across different domains[1][6].

Node-Limited Routing

Additionally, DeepSeek-V3 uses node-limited routing, which restricts the number of nodes each token can communicate with. This strategy minimizes cross-node communication overhead, ensuring efficient training and inference while maintaining balanced expert utilization[6].

In summary, sigmoid gating in DeepSeek-V3 helps prevent routing collapse by allowing multiple experts to be activated simultaneously without forcing a strict competition among them. The dynamic bias adjustment and hierarchical gating further ensure that each expert is utilized effectively, maintaining a balanced load and preventing any expert from dominating the routing decisions.

Citations:
[1] https://www.linkedin.com/posts/sathiyakeerthi_how-deepseek-v3-picks-perfect-experts-activity-7287631625310412800-NCyV
[2] https://martinfowler.com/articles/deepseek-papers.html
[3] https://epochai.substack.com/p/how-has-deepseek-improved-the-transformer
[4] https://machinelearningatscale.substack.com/p/deepseek-v3-model
[5] https://fireworks.ai/blog/deepseek-model-architecture
[6] https://aman.ai/primers/ai/deepseek-R1/
[7] https://gonzoml.substack.com/p/deepseek-v3-technical-details
[8] https://www.kisekilabs.com/blog-posts/why-deepseek-v3-matters-in-the-world-of-llms