GPT-4.5 Performance on SWE-Lancer Benchmark: Managerial Strengths and Coding Challenges

GPT-4.5, OpenAI's latest large language model, demonstrates varying performance across different software engineering tasks, particularly in the SWE-Lancer benchmark. This benchmark evaluates AI models on real-world freelance software engineering tasks, divided into Individual Contributor (IC) SWE Tasks and SWE Manager Tasks.

**IC SWE Tasks involve direct coding, debugging, and implementation, requiring AI models to modify code and submit solutions. These tasks are evaluated using end-to-end tests. GPT-4.5 achieved a modest performance on IC SWE tasks, successfully solving 20% of these tasks. This indicates that while GPT-4.5 can assist with coding tasks, it still faces challenges in fully automating complex coding jobs, similar to other models which have shown around 26% accuracy on direct coding tasks[1][2].

**SWE Manager Tasks, on the other hand, require AI models to act as technical leads, selecting optimal implementation proposals and making decisions. GPT-4.5 performed better on these tasks, achieving a success rate of 44%. This suggests that GPT-4.5 is more adept at managerial roles, such as evaluating code quality and making strategic decisions, which aligns with the general trend of AI models performing well on management tasks, often achieving around 45% accuracy[1][2].

Overall, GPT-4.5 shows a noticeable disparity in its performance between IC SWE tasks and SWE Manager tasks, highlighting its strengths in managerial roles but limitations in direct coding tasks. This distinction underscores the potential of AI in supporting software engineering, particularly in decision-making and strategic roles, while still requiring human oversight for complex coding tasks.

Citations:
[1] https://www.neowin.net/news/openai-announces-gpt-45-its-largest-and-most-knowledgeable-model-yet/
[2] https://adasci.org/benchmarking-ai-on-software-tasks-with-openai-swe-lancer/
[3] https://openai.com/index/introducing-gpt-4-5/
[4] https://www.techtarget.com/searchenterpriseai/tip/GPT-35-vs-GPT-4-Biggest-differences-to-consider
[5] https://topmostads.com/gpt-4-5-vs-gpt-5-release/
[6] https://community.openai.com/t/openai-releases-new-coding-benchmark-swe-lancer-showing-3-5-sonnet-beating-o1/1123976
[7] https://openai.com/index/swe-lancer/
[8] https://techcrunch.com/2025/02/27/openai-unveils-gpt-4-5-orion-its-largest-ai-model-yet/

How does GPT-4.5's performance on SWE Manager tasks compare to IC SWE tasks