Home Arrow Icon Knowledge base Arrow Icon Global Arrow Icon How does GPT-4.5 perform in scenarios where user messages attempt to trick the model


How does GPT-4.5 perform in scenarios where user messages attempt to trick the model


GPT-4.5's performance in scenarios where user messages attempt to trick the model is evaluated through several assessments, focusing on its ability to resist adversarial prompts and maintain safety standards.

Safety Evaluations and Jailbreaks

1. Jailbreak Evaluations: These tests measure how well GPT-4.5 resists attempts to circumvent its safety mechanisms. The model is evaluated against human-sourced jailbreaks and the StrongReject benchmark, which assesses resistance to common adversarial attacks. GPT-4.5 performs well in human-sourced jailbreaks, achieving a high accuracy of 0.99. However, in the StrongReject benchmark, it scores 0.34, which is slightly lower than GPT-4o1's score of 0.87[1].

2. Instruction Hierarchy Evaluations: To mitigate prompt injection attacks, GPT-4.5 is trained to follow a hierarchy of instructions, prioritizing system messages over user messages. In evaluations where system and user messages conflict, GPT-4.5 generally performs well, with an accuracy of 0.76. This is an improvement over GPT-4o but slightly below GPT-4o1's performance[1].

3. Tutor Jailbreaks: In scenarios where the model is instructed not to reveal answers to math questions, GPT-4.5 shows moderate success, with an accuracy of 0.77. This is a significant improvement over GPT-4o but not as high as GPT-4o1's performance[1].

4. Phrase and Password Protection: GPT-4.5 is also evaluated on its ability to protect specific phrases or passwords from being revealed through user messages. It performs well in these tests, with accuracies of 0.86 for phrase protection and 0.92 for password protection[1].

Red Teaming Evaluations

GPT-4.5 undergoes red teaming evaluations designed to test its robustness against adversarial prompts. These evaluations cover scenarios such as illicit advice, extremism, hate crimes, political persuasion, and self-harm. GPT-4.5 produces safe outputs in about 51% of the first red teaming evaluation set, slightly higher than GPT-4o but lower than GPT-4o1. In a second evaluation focused on risky advice, GPT-4.5 performs better than GPT-4o but not as well as GPT-4o1 or deep research models[1].

Overall Performance

While GPT-4.5 demonstrates improvements in handling sensitive and adversarial prompts compared to earlier models, it still faces challenges in highly adversarial scenarios. Its performance is solid for general-purpose tasks but may not be optimal for advanced problem-solving or deep coding tasks compared to specialized models like o3-mini[3][5].

In summary, GPT-4.5 shows resilience against attempts to trick it, particularly in scenarios where it must prioritize system instructions over user inputs. However, it still has limitations in highly adversarial contexts, reflecting ongoing challenges in balancing safety and functionality in AI models.

Citations:
[1] https://cdn.openai.com/gpt-4-5-system-card.pdf
[2] https://platform.openai.com/docs/guides/prompt-engineering
[3] https://www.vellum.ai/blog/gpt-4-5-is-here-heres-how-good-this-model-is
[4] https://www.reddit.com/r/OpenAI/comments/18monbs/gpt_4_has_been_toned_down_significantly_and/
[5] https://www.reddit.com/r/singularity/comments/1izn175/openai_gpt45_system_card/
[6] https://mashable.com/article/openai-gpt-4-5-release-how-to-try
[7] https://www.reddit.com/r/OpenAI/comments/1iznny5/openai_gpt45_system_card/
[8] https://www.youtube.com/watch?v=0kBAS3s5ryw