GPT-4.5 Enhancements in Handling Conflicting Instructions and Safety Improvements

GPT-4.5 demonstrates improved performance in handling conflicting instructions compared to earlier versions, particularly through its enhanced adherence to an Instruction Hierarchy. This hierarchy allows the model to prioritize system messages over user inputs, mitigating risks from conflicting prompts. In evaluations, GPT-4.5 generally outperforms GPT-4o in scenarios where system and user messages conflict, indicating better ability to follow safety instructions and avoid being tricked by adversarial prompts[1][5].

Key Improvements in Handling Conflicting Instructions

1. Instruction Hierarchy Evaluation: GPT-4.5 shows improved accuracy in following system instructions over user messages. For instance, in a scenario where the model is instructed not to give away the answer to a math question, GPT-4.5 performs better than GPT-4o, though not as well as GPT-4o1[1].

2. Mitigation of Prompt Injections: By prioritizing system messages, GPT-4.5 reduces the risk of prompt injections and other attacks that could override its safety instructions. This is crucial for maintaining the model's integrity and preventing misuse[1][5].

3. Realistic Scenarios: In more realistic scenarios, such as when acting as a math tutor, GPT-4.5 is better at resisting attempts to trick it into providing unauthorized information. However, its performance is not perfect and can vary depending on the specific context and instructions provided[1].

4. Safety Evaluations: GPT-4.5 undergoes rigorous safety evaluations to ensure it does not comply with requests for harmful content. While it performs well in refusing unsafe content, it may overrefuse more than earlier models, indicating a cautious approach to handling ambiguous or potentially risky prompts[1].

Overall, GPT-4.5 offers significant improvements in handling conflicting instructions by adhering more closely to system guidelines and reducing the impact of adversarial user inputs. However, like all AI models, it is not immune to all forms of manipulation and continues to evolve with ongoing safety evaluations and updates[1][5].

Citations:
[1] https://cdn.openai.com/gpt-4-5-system-card.pdf
[2] https://gettalkative.com/info/gpt-models-compared
[3] https://www.vellum.ai/blog/gpt-4-5-is-here-heres-how-good-this-model-is
[4] https://www.techtarget.com/searchenterpriseai/tip/GPT-35-vs-GPT-4-Biggest-differences-to-consider
[5] https://www.reddit.com/r/singularity/comments/1izn175/openai_gpt45_system_card/
[6] https://www.reddit.com/r/OpenAI/comments/18monbs/gpt_4_has_been_toned_down_significantly_and/
[7] https://www.theverge.com/news/620021/openai-gpt-4-5-orion-ai-model-release
[8] https://www.axios.com/2025/02/27/chatgpt-45-model-openai-reasoning

How does GPT-4.5's performance compare to earlier versions in handling conflicting instructions

Key Improvements in Handling Conflicting Instructions