GPT-4.5 is designed to handle conflicting instructions between system and user messages through an Instruction Hierarchy. This hierarchy helps the model prioritize system messages over user messages to mitigate risks associated with conflicting prompts, such as prompt injections or other attacks that might override the model's safety instructions.
Key Features of the Instruction Hierarchy
1. Message Classification: GPT-4.5 distinguishes between two types of messages: system messages and user messages. System messages are considered higher authority and are used to set safety guidelines or specific instructions for the model.
2. Conflict Resolution: When system and user messages conflict, GPT-4.5 is trained to follow the instructions in the system message. This ensures that the model adheres to safety guidelines and does not engage in harmful or disallowed activities.
3. Evaluation and Training: The model is evaluated on its ability to handle conflicts through various scenarios. For example, in a scenario where the model is instructed not to give away the answer to a math problem (system message), and the user tries to trick it into doing so (user message), GPT-4.5 must resist the user's attempt and follow the system instruction.
4. Performance: GPT-4.5 generally performs well in these evaluations, showing an improvement over previous models like GPT-4o in handling system-user message conflicts. However, there are specific scenarios where it may not perform as well as GPT-4o or GPT-4o1, such as in certain jailbreak tests where it is tricked into revealing information it should not[1][7].
Safety Evaluations
GPT-4.5 undergoes rigorous safety evaluations to ensure it does not generate disallowed content, such as hateful or illicit advice. These evaluations also assess the model's tendency to overrefuse benign prompts related to safety topics. The model's performance in these areas is critical to maintaining its safety and reliability when handling conflicting instructions[1].
Model Spec and Authority Levels
The Model Spec outlines the authority levels for different types of instructions, with platform-level instructions having the highest authority, followed by developer and user instructions. This hierarchy ensures that GPT-4.5 prioritizes safety and adheres to guidelines while still allowing customization by users and developers within set boundaries[2][5].
Overall, GPT-4.5's ability to handle conflicting instructions is a key aspect of its design, ensuring that it maintains safety and integrity while interacting with users.
Citations:
[1] https://cdn.openai.com/gpt-4-5-system-card.pdf
[2] https://model-spec.openai.com
[3] https://community.openai.com/t/how-to-deal-with-lazy-gpt-4/689286
[4] https://arxiv.org/html/2408.10943v1
[5] https://bgr.com/tech/chatgpt-ai-will-finally-start-covering-controversial-topics/
[6] https://fortune.com/2025/02/14/sam-altman-openai-plans-gpt-5-release-timelines/
[7] https://www.reddit.com/r/singularity/comments/1izn175/openai_gpt45_system_card/
[8] https://www.reddit.com/r/OpenAI/comments/197exmm/gpt4_has_only_been_getting_worse/