ThermEval: A Structured Benchmark for Zero-Shot Evaluation of Vision-Language Models on Thermal Imagery

ThermEval: A Structured Benchmark for Zero-Shot Evaluation of Vision-Language Models on Thermal Imagery

ICLR 2026 Conference Submission20381 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLMs, Thermal Imagery, Zero Shot Evaluation

TL;DR: We introduce ThermEval-B, a 50k-task benchmark for evaluating 14 vision-language models on thermal imagery. While models handle modality recognition, they struggle with reasoning and temperature estimation.

Abstract: Vision-Language Models (VLMs) achieve strong results on RGB imagery, yet their ability to reason over thermal data remains largely unexplored. Thermal imaging is critical in domains where RGB fails, such as surveillance, rescue, and medical diagnostics, but existing benchmarks do not capture its unique properties. We introduce \textbf{ThermEval-B}, a benchmark of 50,000 visual question–answer pairs for evaluating zero-shot performance of open-source VLMs on thermal imagery across tasks including modality identification, human counting, temperature reasoning, and temperature estimation. ThermEval-B integrates public datasets such as LLVIP and FLIR-ADAS with our new dataset \textbf{ThermEval-D}, the first to provide per-pixel temperature annotations across diverse environments. Our evaluation reveals that while VLMs reliably distinguish raw thermal from RGB images, their performance collapses on temperature reasoning and estimation, and modality recognition becomes unreliable under false colormap renderings. Models frequently default to language priors or fixed outputs, exhibit systematic biases, or refuse to answer when uncertain. These recurring failure modes highlight thermal reasoning as an open challenge and motivate benchmarks like ThermEval-B to drive progress beyond RGB-centric evaluation.

Primary Area: datasets and benchmarks

Submission Number: 20381

Loading