OlymBio-Bench: A Multimodal Challenge Towards Expert-Level Biological Reasoning

ICLR 2026 Conference Submission16136 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Biology, Reasoning Model, Multimodal Benchmark
Abstract: The evaluation of large language and multimodal models requires benchmarks that go beyond simple knowledge retrieval to assess complex reasoning, especially in scientific domains such as biology. Existing biology benchmarks fall short, either being text-based, too low-level, or lacking the integrative reasoning needed for graduate-level problems. To address this, we introduce OlymBio-Bench, a novel, graduate-level multimodal benchmark for biology. Sourced from over 220 frontier research papers in the life sciences and curated by a dedicated team of over 60 authors and reviewers, our benchmark is uniquely challenging, requiring sophisticated inference and multi-step reasoning within realistic research contexts. A key feature is its inherent multimodality, with a large majority of questions incorporating essential images, diagrams, and data plots that demand integrated visual and domain-specific understanding. We evaluate a range of state-of-the-art models on OlymBio-Bench, and our results reveal that even the most powerful models fail to achieve a passing score, highlighting critical deficiencies in their ability to perform complex, multimodal scientific reasoning. We further demonstrate a strong correlation between question complexity and model failure rates, with multimodal questions posing a more significant challenge than text-only ones. Our findings confirm that OlymBio-Bench is a formidable and unsolved challenge that can serve as a crucial resource to catalyze the development of next-generation AI models capable of more advanced scientific reasoning.
Primary Area: datasets and benchmarks
Submission Number: 16136
Loading