Track: Track 2: Dataset Proposal Competition
TL;DR: AFDBench is the first benchmark for evaluating AI-generated meteorological text, pairing 7,732 expert NWS forecast discussions with real AI weather data and three complementary metrics.
Abstract: Area Forecast Discussions (AFDs) are the highest stakes scientific text produced by the U.S. National Weather Service—errors carry life-safety consequences. Despite rapid progress in AI weather prediction, no benchmark exists for evaluating whether language models can generate professional meteorological text. We introduce AFDBench, a benchmark comprising 7,732 expertwritten AFDs from 13 NWS offices spanning 4 months and 7 U.S. climate regions, each paired with structured AI weather forecast data from Google’s WeatherNext 2. AFDBench defines three complementary evaluation metrics: MetAlign (numerical accuracy against human reference), Style-Align (adherence to NWS professional vocabulary), and Input-Grounding (fidelity to source weather data). We establish baselines with three open-source LLMs (7–8B parameters), finding that all achieve ∼13% Met-Align, ∼0.33 Style-Align, and ∼0.88 Input-Grounding in zeroshot evaluation—confirming a significant gap between LLM capabilities and professional meteorological writing. We release AFDBench as a public benchmark to accelerate research on scientific text generation where numerical precision and domain expertise are essential.
Keywords: Benchmark, Meteorological reasoning, Scientific text generation, NWS forecast discussions, AI weather prediction, Evaluation, Metrics
Submission Number: 221
Loading