ClimateViz: A Dataset for Evaluating the Fact-checking and Reasoning Abilities of LLMs using Facts Extracted from Scientific Graphs
Abstract: This paper introduces ClimateViz, the largest dataset to date for evaluating the fact-checking and reasoning capabilities of large language models (LLMs) in the climate science domain. ClimateViz comprises claims extracted by humans from high-quality scientific graphics, and checked for acccuracy and domain relevance. To advance the SOTA in NLP for fact-checking, we develop a robust pipeline that systematically generates claims that are highly similar, but false. Additionally, we introduce ReasonClim, a complementary benchmark built using graph-based methods to evaluate spatial, temporal, and spatio-temporal reasoning tasks. To assess LLM's performance on these tasks, we conduct a comprehensive evaluation of the state-of-the-art models. Our findings demonstrate that LLMs struggle with detecting certain types of false claims, especially those generated through exaggeration. The results also highlight significant challenges in fact verification and reasoning over climate data, particularly in temporal reasoning tasks.
By providing a benchmark for evaluating LLMs on real-world climate data, ClimateViz and ReasonClim support the development of more reliable AI systems for climate science applications.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, NLP datasets, evaluation, metrics, statistical testing for evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 1408
Loading