GPT-4.5 is designed to handle conflicting instructions through an Instruction Hierarchy, which prioritizes system messages over user messages to mitigate risks like prompt injections and other attacks overriding safety instructions[1]. However, whether GPT-4.5 can adapt to new types of conflicting instructions it hasn't been trained on is a complex question.
Training and Evaluation
GPT-4.5 has been trained using new supervision techniques combined with traditional methods like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF)[1]. These methods aim to improve the model's alignment with user intent and its ability to follow instructions more accurately. The model has been evaluated on various scenarios where system and user messages conflict, showing improvements over previous models like GPT-4o[1].
Adaptability to New Conflicting Instructions
While GPT-4.5 demonstrates better performance in handling known types of conflicting instructions, its ability to adapt to entirely new types of conflicts depends on several factors:
1. Generalization Capabilities: GPT-4.5's training includes scaling unsupervised learning, which enhances its ability to generalize and understand broader contexts[1]. This could potentially help it recognize patterns in new conflicting instructions.
2. Instruction Hierarchy: The model's instruction hierarchy is designed to prioritize system messages, which helps in maintaining safety and adherence to predefined rules. However, if new conflicting instructions fall outside the scope of this hierarchy, the model might struggle to adapt without additional training or fine-tuning.
3. Prompt Engineering and Scaffolding: Users can employ advanced prompting techniques or scaffolding to guide the model towards understanding and following new instructions. This approach can help elicit behaviors beyond what the model was explicitly trained for[1].
4. Limitations and Safety Evaluations: Despite improvements, GPT-4.5 still faces challenges in handling complex or novel scenarios. Safety evaluations have shown that while the model performs well on known tasks, there is always a risk of it failing to follow instructions in unforeseen contexts[1].
Conclusion
While GPT-4.5 is more capable than its predecessors in handling conflicting instructions, its adaptability to entirely new types of conflicts is limited by its training data and design. It may require additional fine-tuning or creative prompting strategies to effectively handle novel scenarios. The ongoing research and user interactions with GPT-4.5 will provide more insights into its capabilities and limitations in this area.
Citations:
[1] https://cdn.openai.com/gpt-4-5-system-card.pdf
[2] https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf
[3] https://latenode.com/blog/chatgpt-4-5-review
[4] https://arxiv.org/pdf/2203.02155.pdf
[5] https://www.theverge.com/news/620067/openai-gpt-4-5-nearly-here
[6] https://community.openai.com/t/gpt-4o-not-following-simple-and-clear-instructions/768674
[7] https://www.reddit.com/r/singularity/comments/1bgpmnv/lets_say_gpt45_releases_next_week_what_are_your/
[8] https://www.reddit.com/r/OpenAI/comments/18monbs/gpt_4_has_been_toned_down_significantly_and/