CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning

Hao Cui; Zahra Shamsi; Gowoon Cheon; Xuejian Ma; Shutong Li; Maria Tikhanovskaya; Peter Christian Norgaard; Nayantara Mudur; Martyna Beata Plomecka; Paul Raccuglia; Yasaman Bahri; Victor V. Albert; Pranesh Srinivasan; Haining Pan; Philippe Faist; Brian A Rohr; Michael J. Statt; Dan Morris; Drew Purves; Elise Kleeman; Ruth Alcantara; Matthew Abraham; Muqthar Mohammad; Ean Phing VanLee; Chenfei Jiang; Elizabeth Dorfman; Eun-Ah Kim; Michael Brenner; Sameera S Ponda; Subhashini Venugopalan

CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning

Published: 22 Jan 2025, Last Modified: 02 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: science, LLMs, evaluation, benchmark, long-context

TL;DR: CURIE is a long context comprehension benchmark to asses the ability of LLMs in assisting in realistic scientific workflows requiring deep expertise.

Abstract: Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding, Reasoning, and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geo-spatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Links to the data and evaluation code are in https://github.com/google/curie

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3300

Loading