ClimateViz: A Dataset for Evaluating the Fact-checking and Reasoning Abilities of LLMs using Facts Extracted from Scientific Graphs

ClimateViz: A Dataset for Evaluating the Fact-checking and Reasoning Abilities of LLMs using Facts Extracted from Scientific Graphs

ACL ARR 2025 February Submission1408 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper introduces ClimateViz, the largest dataset to date for evaluating the fact-checking and reasoning capabilities of large language models (LLMs) in the climate science domain. ClimateViz comprises claims extracted by humans from high-quality scientific graphics, and checked for acccuracy and domain relevance. To advance the SOTA in NLP for fact-checking, we develop a robust pipeline that systematically generates claims that are highly similar, but false. Additionally, we introduce ReasonClim, a complementary benchmark built using graph-based methods to evaluate spatial, temporal, and spatio-temporal reasoning tasks. To assess LLM's performance on these tasks, we conduct a comprehensive evaluation of the state-of-the-art models. Our findings demonstrate that LLMs struggle with detecting certain types of false claims, especially those generated through exaggeration. The results also highlight significant challenges in fact verification and reasoning over climate data, particularly in temporal reasoning tasks. By providing a benchmark for evaluating LLMs on real-world climate data, ClimateViz and ReasonClim support the development of more reliable AI systems for climate science applications.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, NLP datasets, evaluation, metrics, statistical testing for evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 1408

Loading