GPT-4.5's Instruction Hierarchy is designed to mitigate the risk of prompt injections by establishing a clear priority order for different types of instructions. This hierarchy ensures that system messages, which are set by developers, take precedence over user messages and other inputs. Here's how it works and how it helps prevent prompt injection attacks:
Understanding Prompt Injections
Prompt injection attacks occur when malicious users manipulate AI models by providing inputs that override the original system instructions. This can lead to unintended behavior, such as revealing sensitive information or performing unauthorized actions[2][3].The Instruction Hierarchy
The instruction hierarchy in GPT-4.5 prioritizes instructions based on their source and importance. It categorizes inputs into several types, typically including:- System Messages: These are the highest-priority instructions set by developers. They define the primary tasks and constraints that the model should follow.
- User Messages: These are inputs provided by users and are considered lower in priority than system messages.
- Conversation History and Tool Outputs: These may also influence the model but are generally lower in priority than user messages[1][3].
Mitigating Prompt Injections
To mitigate prompt injection attacks, the instruction hierarchy ensures that GPT-4.5 adheres to the following principles:- Prioritization: The model prioritizes system messages over user inputs. If a user attempts to inject a prompt that contradicts the system instructions, the model will default to the original guidance[3][5].
- Detection of Misaligned Prompts: GPT-4.5 is trained to identify and ignore prompts that conflict with the system instructions. For example, if a user inputs "forget all previous instructions," the model will recognize this as a misaligned prompt and respond accordingly[3].
- Selective Ignoring of Lower-Priority Instructions: The model is designed to selectively ignore lower-priority instructions when they conflict with higher-priority ones. This ensures that the model maintains its intended behavior even when faced with malicious inputs[7].
Training and Evaluation
GPT-4.5 is trained using techniques that emphasize the instruction hierarchy, such as synthetic data generation and context distillation. These methods help the model learn to prioritize instructions effectively and resist malicious inputs[6][7]. The model's performance is evaluated in scenarios where system messages conflict with user inputs, ensuring that it adheres to the instruction hierarchy and maintains its security features[5].Overall, the instruction hierarchy in GPT-4.5 enhances the model's security by ensuring that it follows the intended instructions set by developers, even in the face of conflicting or malicious user inputs. This approach is crucial for building trustworthy AI systems that can operate safely in real-world applications.
Citations:
[1] https://arxiv.org/html/2502.08745v1
[2] https://www.aporia.com/learn/prompt-injection-types-prevention-examples/
[3] https://www.amitysolutions.com/blog/gpt4o-mini-instruction-hierarchy
[4] https://dev.to/jasny/protecting-against-prompt-injection-in-gpt-1gf8
[5] https://cdn.openai.com/gpt-4-5-system-card.pdf
[6] https://community.openai.com/t/protecting-llms-from-prompt-injections-and-jailbreaks-new-openai-paper/727636
[7] https://openai.com/index/the-instruction-hierarchy/
[8] https://arxiv.org/html/2311.11538v2