Keywords: Large Language Models, Benchmark, Meteorological Reasoning
Abstract: Recent advances in data-driven weather modeling have enabled accurate numerical forecasts, whose outputs are often summarized as natural-language descriptions for interpretation and decision making. While large language models (LLMs) show promise in scientific reasoning, their ability to reason over text-only meteorological summaries, under physical constraints, incomplete evidence, and inherent uncertainty, remains poorly understood. Existing benchmarks primarily rely on multimodal inputs or fact verification, leaving this gap unaddressed. We introduce WeatherBench-R, a large-scale text-only benchmark for meteorological reasoning over U.S. weather events, constructed from ERA5 reanalysis summaries aligned with real-world NOAA storm records. WeatherBench-R decomposes reasoning into three complementary tasks: physical plausibility recognition from aggregate trends, consistency verification under partial and underspecified evidence, and counterfactual evidence reasoning that probes uncertainty awareness and explanation quality. The benchmark comprises 13,116 event-centered summaries spanning diverse event types and trend patterns. A systematic evaluation of LLMs reveals fragmented strengths across tasks, substantial performance degradation under counterfactual perturbations, and distinct failure modes in plausibility calibration and uncertainty handling.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Language Modeling
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 9814
Loading