Home Arrow Icon Knowledge base Arrow Icon Global Arrow Icon How does DeepSeek-R1's performance on the MATH-500 benchmark compare to its performance on the AIME 2024 benchmark


How does DeepSeek-R1's performance on the MATH-500 benchmark compare to its performance on the AIME 2024 benchmark


DeepSeek-R1 demonstrates strong performance across various mathematical benchmarks, particularly on the MATH-500 and AIME 2024 tests. Here's a detailed comparison of its performance on these two benchmarks:

MATH-500 Benchmark

- Performance: DeepSeek-R1 achieves an impressive Pass@1 score of 97.3% on the MATH-500 benchmark. This score indicates that the model is highly effective in solving diverse high-school-level mathematical problems requiring detailed reasoning[1][4].
- Comparison to OpenAI o1-1217: DeepSeek-R1 slightly surpasses OpenAI o1-1217, which scores 96.4% on the same benchmark. This suggests that DeepSeek-R1 has a slight edge in handling the types of mathematical problems presented in MATH-500[4][6].

AIME 2024 Benchmark

- Performance: On the AIME 2024 benchmark, DeepSeek-R1 scores 79.8%. This benchmark evaluates advanced multi-step mathematical reasoning, and DeepSeek-R1's performance indicates it is capable of handling complex mathematical problems[1][4].
- Comparison to OpenAI o1-1217: DeepSeek-R1 also slightly outperforms OpenAI o1-1217 on AIME 2024, which scores 79.2%. This marginal difference suggests that both models are highly competitive in advanced mathematical reasoning tasks[4][6].

Key Differences Between Benchmarks

- Problem Complexity: AIME 2024 focuses on more advanced and complex mathematical problems compared to MATH-500, which includes a broader range of high-school-level problems.
- Model Performance: DeepSeek-R1 shows a higher success rate on MATH-500 than on AIME 2024, indicating that it is more effective in solving a wide range of mathematical problems rather than just the advanced ones.

Overall, DeepSeek-R1 demonstrates strong mathematical reasoning capabilities, with a notable edge in solving a variety of mathematical problems as seen in the MATH-500 benchmark, and competitive performance in advanced mathematical reasoning tasks as evaluated by AIME 2024.

Citations:
[1] https://huggingface.co/deepseek-ai/DeepSeek-R1
[2] https://artificialanalysis.ai/models/deepseek-r1
[3] https://blog.promptlayer.com/openai-o3-vs-deepseek-r1-an-analysis-of-reasoning-models/
[4] https://www.datacamp.com/blog/deepseek-r1
[5] https://arcprize.org/blog/r1-zero-r1-results-analysis
[6] https://www.inferless.com/learn/the-ultimate-guide-to-deepseek-models
[7] https://techcrunch.com/2025/01/27/deepseek-claims-its-reasoning-model-beats-openais-o1-on-certain-benchmarks/
[8] https://www.geekwire.com/2025/deepseeks-new-model-shows-that-ai-expertise-might-matter-more-than-compute-in-2025/