WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

Yongan Yu; Xianda Du; Qingchen Hu; JIAHAO LIANG; Jingwei Ni; Dan Qiang; Kaiyu Huang; Grant McKenzie; Renée Sieber; Fengran Mo

WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

Yongan Yu, Xianda Du, Qingchen Hu, JIAHAO LIANG, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie, Renée Sieber, Fengran Mo

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: domain-specific benchmark, large language models

TL;DR: A large-scale benchmark for evaluating retrieval-augmented reasoning on historical weather archives to assess societal vulnerability and resilience.

Abstract: Historical news segments on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchival-Bench, the first benchmark for evaluating end-to-end retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchival-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system’s ability to locate historically relevant archival news segments from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives and answer queries using the archival news segments retrieved. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at: \href{https://anonymous.4open.science/r/WeatherArchive-Bench/}

Primary Area: datasets and benchmarks

Submission Number: 14089

Loading