DeepSeek employs several strategies to prevent token dropping during training, particularly in its Mixture-of-Experts (MoE) architecture. Here are the key measures:
1. Token-Dropping Strategy: DeepSeek utilizes a token-wise dropping approach based on routing probabilities, similar to conventional MoE models. This means that tokens are dropped according to their scores, ensuring that only the least important tokens are discarded when the token count exceeds a certain capacity. The capacity is calculated differently depending on whether the focus is on the expert or device dimension, which helps manage how tokens are processed by different experts or devices effectively[1].
2. Device-Level Token Dropping: The strategy primarily operates at the device level where all expert scores are sorted, and tokens with the lowest probabilities are dropped. This ensures that each device can manage its load without overwhelming any single expert while still maintaining overall performance efficiency[1].
3. Guaranteed Token Retention: DeepSeek has implemented a specific strategy to ensure that approximately 10% of the tokens from training sequences will never be dropped. This is crucial for maintaining the integrity of the training data and ensuring that important information is preserved throughout the training process[2][6].
4. Load Balancing: In its latest version, DeepSeek-V3, an effective load balancing strategy is employed which eliminates the need for token dropping altogether during training. By dynamically adjusting expert loads and utilizing a multi-token prediction objective, DeepSeek-V3 maintains a balanced distribution of tokens across experts, thus preventing any loss of data during training[2].
These measures reflect DeepSeek's commitment to optimizing its training processes while minimizing data loss and maximizing computational efficiency.
Citations:
[1] https://github.com/deepseek-ai/DeepSeek-V2/issues/5
[2] https://arxiv.org/html/2412.19437v1
[3] https://www.tradingview.com/news/financemagnates:c7148cfce094b:0-why-is-crypto-going-down-fake-deepseek-tokens-affect-thousands-of-traders/
[4] https://planetbanatt.net/articles/deepseek.html
[5] https://www.financemagnates.com/cryptocurrency/why-is-crypto-going-down-fake-deepseek-tokens-affect-thousands-of-traders/
[6] https://arxiv.org/html/2405.04434v5
[7] https://www.binance.com/square/post/19557042115937
[8] https://community.openai.com/t/will-deepseek-r1-force-openai-to-reduce-token-prices/1097716