OlymBio-Bench: A Multimodal Challenge Towards Expert-Level Biological Reasoning

Linfeng Hao; Yiyan Liao; Shen Nanqi; Yuehao Gao; Yuhan Wu; Jing Yi; Haoran Sun; TAO YANG; Lixin Liu; Junting Zhou; Tong Yang

OlymBio-Bench: A Multimodal Challenge Towards Expert-Level Biological Reasoning

Linfeng Hao, Yiyan Liao, Shen Nanqi, Yuehao Gao, Yuhan Wu, Jing Yi, Haoran Sun, TAO YANG, Lixin Liu, Junting Zhou, Tong Yang

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Biology, Reasoning Model, Multimodal Benchmark

Abstract: The evaluation of large language and multimodal models requires benchmarks that go beyond simple knowledge retrieval to assess complex reasoning, especially in scientific domains such as biology. Existing biology benchmarks fall short, either being text-based, too low-level, or lacking the integrative reasoning needed for graduate-level problems. To address this, we introduce OlymBio-Bench, a novel, graduate-level multimodal benchmark for biology. Sourced from over 220 frontier research papers in the life sciences and curated by a dedicated team of over 60 authors and reviewers, our benchmark is uniquely challenging, requiring sophisticated inference and multi-step reasoning within realistic research contexts. A key feature is its inherent multimodality, with a large majority of questions incorporating essential images, diagrams, and data plots that demand integrated visual and domain-specific understanding. We evaluate a range of state-of-the-art models on OlymBio-Bench, and our results reveal that even the most powerful models fail to achieve a passing score, highlighting critical deficiencies in their ability to perform complex, multimodal scientific reasoning. We further demonstrate a strong correlation between question complexity and model failure rates, with multimodal questions posing a more significant challenge than text-only ones. Our findings confirm that OlymBio-Bench is a formidable and unsolved challenge that can serve as a crucial resource to catalyze the development of next-generation AI models capable of more advanced scientific reasoning.

Primary Area: datasets and benchmarks

Submission Number: 16136

Loading