Keywords: Multimodal Climate Benchmark, Scientific Foundation Models, Scientific Question Answering, Large Language Models, Automated QA generation
Abstract: Climate change research increasingly requires AI systems that can operate across multiple modalities, including natural language, dynamic visual content, and scientific figures. Yet existing climate QA benchmarks remain limited: they include relatively small sets of questions, rely almost exclusively on text, and evaluate only a narrow range of models. As a result, they fail to reflect the multimodal and large-scale nature of climate knowledge. In this work, we introduce MMClima, a multimodal framework for climate question answering. MMClima contains over 104k expert-validated question–answer pairs spanning text, video transcriptions, and figures, alongside covering a diverse range of five core climate science domains. The dataset is constructed through automated claim extraction combined with human-in-the-loop validation to ensure both scale and reliability. Beyond serving as a dataset, MMClima provides a reusable framework for extending QA resources across modalities. Using MMClima, we evaluate state-of-the-art multimodal language models on tasks spanning factual recall, visual interpretation, and cross-modal synthesis. We further fine-tune on the textual split, yielding mmclima-70b-txt, a domain-adapted baseline that surpasses both open- and closed-source models. Finally, we release the dataset, evaluation pipeline, fine-tuned model weights, and data creation framework as open resources, establishing the first step toward standardized multimodal evaluation in climate science.
Primary Area: datasets and benchmarks
Submission Number: 11109
Loading