In DeepSeek-V3, the sigmoid activation function plays a crucial role in the expert routing process by modifying how tokens are assigned to experts. Unlike earlier versions that used a softmax function, DeepSeek-V3 employs a sigmoid function to compute the affinity scores between tokens and experts. This change helps prevent extreme expert selection probabilities, which can lead to routing collapseâa situation where the model favors a few experts over others, diminishing the benefits of specialization and efficiency.
Sigmoid Activation Function in Expert Routing
The sigmoid function, denoted as $$\sigma(\cdot)$$, is used to calculate the affinity score between a token and an expert. Specifically, the score $$s_{i,t}$$ for token $$t$$ and expert $$i$$ is computed as:
$$s_{i,t} = \sigma(u_t^T e_i)$$
where $$u_t$$ is the token embedding and $$e_i$$ is the centroid vector of expert $$i$$. This score reflects how well the token aligns with the expert's specialty.
Normalization and Selection
After computing these scores, DeepSeek-V3 normalizes them and selects the top-$$K_r$$ experts based on these normalized scores. This process ensures that each token is routed to a subset of experts that are most relevant to it, promoting efficient and specialized processing.
Bias Terms for Load Balancing
To prevent routing collapse and ensure balanced load distribution among experts, DeepSeek-V3 introduces dynamically adjustable bias terms. These bias terms are added to the affinity scores before selecting the top experts. If an expert is overloaded, its bias term is decreased, and if it's underloaded, the bias term is increased. This mechanism ensures that the load remains balanced without relying on auxiliary loss functions, which can negatively impact model performance[1][3].
Benefits of Sigmoid Over Softmax
Using a sigmoid function instead of softmax helps decouple the selection probabilities of different experts. In softmax, the probabilities are normalized to sum to one, which can lead to extreme probabilities when one expert is significantly favored. Sigmoid, on the other hand, allows for more flexible and independent probability assignments, reducing the likelihood of routing collapse and promoting more balanced expert utilization[4].
Overall, the sigmoid activation function in DeepSeek-V3 enhances the model's ability to efficiently route tokens to relevant experts while maintaining a balanced workload, which is crucial for achieving high performance and computational efficiency in large-scale Mixture-of-Experts (MoE) architectures.
Citations:
[1] https://gonzoml.substack.com/p/deepseek-v3-technical-details
[2] https://aman.ai/primers/ai/deepseek-R1/
[3] https://machinelearningatscale.substack.com/p/deepseek-v3-model
[4] https://mlfrontiers.substack.com/p/understanding-deepseek-v3
[5] https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/deepseek_v3.html
[6] https://planetbanatt.net/articles/deepseek.html
[7] https://arxiv.org/pdf/2412.19437.pdf
[8] https://builtin.com/machine-learning/sigmoid-activation-function