Evaluating GPT-4.5's Multilingual Capabilities: Methodologies and Insights

To evaluate the multilingual capabilities of GPT-4.5, OpenAI employed several methodologies:

1. Translation of Test Sets: OpenAI translated the test set of the Multilingual Massive Language Understanding (MMLU) benchmark into 14 languages using professional human translators. This approach allowed for a comprehensive assessment of GPT-4.5's performance across multiple languages, including Arabic, Bengali, Chinese (Simplified), English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazil), Spanish, Swahili, and Yoruba[1].

2. Zero-Shot Evaluation: The model was tested in a zero-shot setting, meaning it was not provided with additional training data or fine-tuning for specific languages. This method evaluates the model's ability to generalize across languages without explicit training[1].

3. Comparison with Previous Models: GPT-4.5's performance was compared to its predecessors, such as GPT-4o and o1, to assess improvements in multilingual capabilities. This comparison helps identify areas where GPT-4.5 has advanced and where it may still require improvement[1].

4. Safety Evaluations: While not exclusively focused on multilingual capabilities, safety evaluations also consider how well the model handles diverse linguistic inputs, ensuring it does not generate harmful or inappropriate content across languages[1].

5. External Evaluations: OpenAI likely leveraged external frameworks and tools, similar to the C-LARA platform used for evaluating GPT-4's multilingual processing, to further assess GPT-4.5's language capabilities[4]. However, specific details about the use of such platforms for GPT-4.5 are not provided in the available information.

These methodologies collectively provide a robust assessment of GPT-4.5's multilingual performance, highlighting its strengths and areas for future improvement.

Citations:
[1] https://cdn.openai.com/gpt-4-5-system-card.pdf
[2] https://pmc.ncbi.nlm.nih.gov/articles/PMC11348013/
[3] https://openai.com/index/gpt-4-5-system-card/
[4] https://www.researchgate.net/publication/375999167_Using_C-LARA_to_evaluate_GPT-4's_multilingual_processing
[5] https://www.theverge.com/news/620021/openai-gpt-4-5-orion-ai-model-release
[6] https://www.mdpi.com/2227-7102/14/2/148
[7] https://venturebeat.com/ai/openai-releases-gpt-4-5/
[8] https://techcrunch.com/2025/02/27/openai-unveils-gpt-4-5-orion-its-largest-ai-model-yet/

What methodologies were used to evaluate GPT-4.5's multilingual capabilities