MUIQD: Benchmarking and Facilitating Multimodal LLMs for Underwater Image Quality Perception

Yutao Liu; Jiaxu Ma; Yisheng Lin; Yan Zhang; Ke Gu; Jingchao Cao

MUIQD: Benchmarking and Facilitating Multimodal LLMs for Underwater Image Quality Perception

Yutao Liu, Jiaxu Ma, Yisheng Lin, Yan Zhang, Ke Gu, Jingchao Cao

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Underwater image quality perception, image dataset, multimodal large language models, fine-tuning

Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated great potential in cross-modal perception, reasoning and generation. However, their effectiveness in underwater image quality perception, a fundamental requirement for efficient underwater vision tasks, remains largely unexplored. To address this gap, in this paper we propose the first large-scale dataset for benchmarking and facilitating underwater image quality perception of MLLMs, termed MUIQD. MUIQD is composed of two complementary subsets, MUIQD-Description and MUIQD-VQA, addressing the abilities of quality perception and interaction of MLLMs, respectively. Specifically, MUIQD-Description comprises 18634 underwater images covering diverse real-world underwater scenes and typical quality degradations, such as color cast, haze effect, blurring and low contrast, etc. Each image is annotated through rigorous subjective evaluation with detailed descriptions of quality-related attributes and an overall quality level derived from the attributes. To further improve the interactive quality perception capability of MLLMs, we build the visual-question-answering-based dataset MUIQD-VQA, containing more than 93K question-answer pairs derived from the MUIQD-Description dataset generated by DeepSeek. Experimental results demonstrate that the proposed MUIQD dataset promotes the abilities of MLLMs on underwater image quality perception significantly, strongly supporting that MLLMs can be adapted for underwater image quality perception.

Supplementary Material: pdf

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 8832

Loading