Keywords: Real-world Data-centric automatic R&D Benchmark, data-centric automatic R&D, trustworthy models
Abstract: The progress of humanity is driven by those successful discoveries accompanied by countless failed experiments. Researchers often seek potential solutions described in related literature (raw information) and verify them through experiments. With the explosive growth of deep learning literature and methods, such a process imposes a more significant burden on researchers and renders successful discoveries veiled. Therefore, automating such a research and development (R&D) process is an urgent need. In this paper, we serve as the first effort to formalize the goal by proposing a **R**eal-world **D**ata-centric automatic **R**&**D** **Bench**mark, namely RD2Bench. RD2Bench benchmarks the whole data-centric automatic R&D (D-CARD) process, including extracting methods (formulas and models) from raw information (reports and papers) and implementing methods through codes. Specifically, to investigate the capability boundaries of the state-of-the-art (SOTA) large language models (LLMs) in the unexplored D-CARD, we conduct exhausting and expensive human annotations and experiments. We evaluate the performance of SOTA LLMs on our identified 27 formulas and 6 models across various difficulty levels from financial reports and ML papers. We find that although RD2Bench is very challenging, SOTA LLMs possess promising potential to bring more significant development to D-CARD. We appeal to research teams with various domain expertise to consider constructing domain-specific D-CARD benchmarks, contributing to both a cross-domain D-CARD platform and the potential revolutionary upgrade to human productivity.}
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4338
Loading