WeatherBench-R: A Text-Only Benchmark for Evaluating Large Language Models over U.S. Weather Events

WeatherBench-R: A Text-Only Benchmark for Evaluating Large Language Models over U.S. Weather Events

ACL ARR 2026 January Submission9814 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Benchmark, Meteorological Reasoning

Abstract: Recent advances in data-driven weather modeling have enabled accurate numerical forecasts, whose outputs are often summarized as natural-language descriptions for interpretation and decision making. While large language models (LLMs) show promise in scientific reasoning, their ability to reason over text-only meteorological summaries, under physical constraints, incomplete evidence, and inherent uncertainty, remains poorly understood. Existing benchmarks primarily rely on multimodal inputs or fact verification, leaving this gap unaddressed. We introduce WeatherBench-R, a large-scale text-only benchmark for meteorological reasoning over U.S. weather events, constructed from ERA5 reanalysis summaries aligned with real-world NOAA storm records. WeatherBench-R decomposes reasoning into three complementary tasks: physical plausibility recognition from aggregate trends, consistency verification under partial and underspecified evidence, and counterfactual evidence reasoning that probes uncertainty awareness and explanation quality. The benchmark comprises 13,116 event-centered summaries spanning diverse event types and trend patterns. A systematic evaluation of LLMs reveals fragmented strengths across tasks, substantial performance degradation under counterfactual perturbations, and distinct failure modes in plausibility calibration and uncertainty handling.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation, Language Modeling

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 9814

Loading