Think you have Solved Commonsense Reasoning? Try HellaswagUltra

ICLR 2026 Conference Submission16714 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multilingual Large Language Models
Abstract: With the evolution of large language models (LLMs), widely used commonsense reasoning and natural language understanding benchmarks have become saturated. At the same time, the number of languages supported by LLMs has been growing rapidly, while existing benchmarks cover only a limited set of languages, leaving many unsupported. Moreover, some multilingual benchmarks rely on translating English benchmarks, which introduces evaluation bias. To address these issues, we propose HellaSwagUltra, a commonsense reasoning and natural language understanding benchmark covering 60+ languages. It includes a large amount of local cultural knowledge for each language. We design an automated data construction pipeline, making it easy to continuously expand. Unlike existing work that explicitly tests reasoning skills, HellaSwagUltra embeds two commonsense or local knowledge facts implicitly in the context of each question. Each answer choice reveals subtle clues indicating whether the knowledge is violated. Models must sensitively detect these differences between options to select the most plausible continuation. In addition, we recruited experts for each language to fully review and correct all test items, and we continue to update them. Experiments show that even the strongest proprietary models (e.g., Gemini-2.5-Pro) achieve only 62.5\% accuracy, while GPT-4o and leading open-source models remain near 40–50\%. Our results highlight that multilingual commonsense reasoning remains a major open challenge, and we release both dataset and pipeline to support future research.
Primary Area: datasets and benchmarks
Submission Number: 16714
Loading