BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Zhiheng Xi; Guanyu Li; YuTao Fan; Honglin Guo; Yufang Liu; Xiaoran Fan; Jiaqi Liu; dingjinchao; Wangmeng Zuo; Zhenfei Yin; LEI BAI; Tao Ji; Tao Gui; Qi Zhang; Xuanjing Huang

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Zhiheng Xi, Guanyu Li, YuTao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, dingjinchao, Wangmeng Zuo, Zhenfei Yin, LEI BAI, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Published: 18 Sept 2025, Last Modified: 18 Jan 2026NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large multimodal models, reasoning

Abstract: In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 100k university-level questions drawn from 300 UNESCO-defined subjects, spanning diverse formats—multiple-choice, fill-in-the-blank, and open-ended QA—and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop, automated, and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20k high-quality instances to comprehensively assess LMMs’ knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 80k instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline BMMR-Verifier for accurate and fine-grained evaluation of LMMs’ reasoning. Extensive experiments reveal that (i) even SOTA models leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data and models, and we believe our work can offers valuable insights and contributions to the community.

Croissant File: json

Dataset URL: https://kaggle.com/datasets/4fd00a65e7829a1db8d79cf040652652f20e37c0cee6c878d3104a615b67d239

Code URL: https://anonymous.4open.science/r/BMMR-code-for-NIPS2025-48C3/

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 2221

Loading