IQA-Octopus: Unified Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring

IQA-Octopus: Unified Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring

ICLR 2026 Conference Submission7731 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: image quality assessment, fine-grained, zero-shot grounding

Abstract: We present IQA-Octopus, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring. Built upon large multi-modality models (LMMs), IQA-Octopus is designed to perform multi-functional quality analysis by enabling multi-granularity perception capabilities. Existing LMM-based IQA models can merely support partial perception dimensions, \egno, quality descriptions and question answering (\ieno, reasoning), pixel-wise grounding individually due to the lack of (i) unified IQA datasets with multi-functional annotations and (ii) a proper optimization paradigm assisting multi-granularity perception. To overcome this, we built the first multi-functional datasets incorporating global/local reasoning, pixel-wise grounding, and region-wise referring tasks, followed by an elaborately designed automatic multi-granularity dataset extension method, which ensures the optimization of our IQA-Octopus. To facilitate multi-functional perception, we introduce a conflict-free two-stage optimization strategy that progressively transfers multi-granularity textual understanding capability to pixel-wise perception: (i) The first stage injects the textual multi-granularity perception into IQA-Octopus with the joint optimization of multiple textual-based perception tasks, including reasoning and referring, and (ii) The second stage introduces a novel text-to-point strategy in the pixel-wise perception stage, which implicitly warps the text logits to coordinates of pixel-wise grounding in a zero-shot manner. Based on the above two efforts, we achieve our IQA-Octopus by unifying multi-functional and multi-granularity explainable image quality assessment. Our model achieves comparable or state-of-the-art performance across multiple benchmarks with limited training data, demonstrating strong multi-granularity understanding and remarkable versatility. The code and dataset will be released after acceptance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7731

Loading