AAAR-1.0: Assessing AI’s Potential to Assist Research

Renze Lou; Hanzi Xu; Sijia Wang; Jiangshu Du; Ryo Kamoi; Xiaoxin Lu; Jian Xie; Yuxuan Sun; Yusen Zhang; Jihyun Janice Ahn; Hongchao Fang; Zhuoyang Zou; Wenchao Ma; Xi Li; Kai Zhang; Congying Xia; Lifu Huang; Wenpeng Yin

AAAR-1.0: Assessing AI’s Potential to Assist Research

Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; and (iii) PaperWeakness, identifying weaknesses in paper submissions. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will release the AAAR-1.0 and keep iterating it to new versions.

Lay Summary: Artificial intelligence has shown great promise in helping with routine tasks like writing emails or answering questions, but its usefulness in supporting the work of scientific researchers remains unclear. For example, researchers often need to check the accuracy of equations, design effective experiments, and spot weaknesses in scientific papers—tasks that require in-depth expertise and careful reasoning. Our work introduces AAAR-1.0, a new benchmark designed to see how well modern AI language models, such as ChatGPT and its peers, can handle these demanding research duties. AAAR-1.0 focuses on three key activities that scientists regularly do: evaluating equations in papers, coming up with plans for experiments, and giving useful feedback on scientific drafts. We tested various AI models and found that, while they can sometimes offer helpful or creative ideas, they still struggle with the level of accuracy and insight needed for advanced research. Our benchmark aims to guide improvements in AI tools and help researchers use them thoughtfully, seeing them as supportive assistants rather than replacements.

Link To Code: https://github.com/RenzeLou/AAAR-1.0

Primary Area: General Machine Learning->Evaluation

Keywords: LLMs, Benchmark, AI4Research

Submission Number: 10684

Loading