SciDA: Scientific Dynamic Assessor of LLMs

Junting Zhou; Tingjia Miao; Yiyan Liao; qichao Wang; Zhoufutu Wen; Yuansong Zeng; Yanqin Wang; Yunjie Huang; Leqi Wang; Ge Yan; Yucheng Xia; Hongwan Gao; Qiguang Chen; Chen Dun; Renjie Zheng; Yitao Liang; Libo Qin; Tong Yang; Wenhao Huang; Ge Zhang; Wanxiang Che

SciDA: Scientific Dynamic Assessor of LLMs

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Multidisciplinary Benchmark, Data Contamination, Randomized Initialization, Dynamic Assessor

Abstract: Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose **SciDA**, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both closed-source and open-source top-performing LLMs, and it is observed that the performance of LLMs drop significantly under random numerical initialization. Thus, we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs. The evaluation framework has been anonymized and is publicly available at **https://anonymous.4open.science/r/SciDA-0184**

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 8052

Loading

SciDA: Scientific Dynamic Assessor of LLMs

Junting Zhou, Tingjia Miao, Yiyan Liao, qichao Wang, Zhoufutu Wen, Yuansong Zeng, Yanqin Wang, Yunjie Huang, Leqi Wang, Ge Yan, Yucheng Xia, Hongwan Gao, Qiguang Chen, Chen Dun, Renjie Zheng, Yitao Liang, Libo Qin, Tong Yang, Wenhao Huang, Ge Zhang et al. (1 additional authors not shown)