Blind Omnidirectional Image Quality Assessment: Embracing the Magic Power of Multimodal Large Language Models
Abstract: Blind omnidirectional image quality assessment (BOIQA) has been a challenging problem in the image quality assessment field, due to the geometric characteristic of omnidirectional images (OIs) and complicated human behavior in immersive experience. Toward solving this problem, we resort to Multimodal Large Language Models (MLLMs), which show great success in both computer vision and natural language processing, while they have not been investigated in BOIQA. Specifically, we first generate coarse and detailed quality-aware descriptions for OIs by feat of MLLMs to get richer information, instead of simple quantitative scalars. Upon the generated text descriptions and the paired images, we fine-tune a top-performing model (i.e., Long-CLIP) under the general contrastive learning framework, mining robust and representative embeddings in the vision-language space. Then, we design a family of MultiModal BOIQA (\(\textrm{MMBO}\)) models based on the embeddings in the built vision-language space, comprehensively investigating the effectiveness of text features, visual features, and their interaction in capturing quality degradation of OIs. Experimental results on two large-scale OIQA databases demonstrate the superior performance of \(\textrm{MMBO}\) models, e.g., the best performing \(\textrm{MMBO}\) model outperforms the second-ranked method 9.4\(\%\) and 6.6\(\%\) on these two OIQA databases in terms of PLCC, respectively, and shows promising generalizability in cross-database validation and gMAD competition.
External IDs:dblp:journals/ijcv/YanZCCLTF26
Loading