Keywords: Image complexity assessment, vision language align, feature entropy, image quality assessment
TL;DR: We proposed a text-guided and efficient method for image complexity assessment. It achieves state-of-the-art results on IC9600 and competitive performance on NR-IQA.
Abstract: Accurately assessing image complexity (IC) is essential for many vision tasks, yet existing approaches rely almost exclusively on visual features and therefore fail to capture the high-level semantics that humans often use when judging complexity. We introduce a multimodal perspective for IC modeling by integrating visual representations with caption-derived textual semantics. This integration enriches the representational space and provides complementary structural cues that are difficult to infer from vision alone. From an information theoretic and representation viewpoint, we offer an idealized analysis suggesting how semantic guidance can regularize the hypothesis space and support more stable generalization. We present D2S (Describe-to-Score), a framework that generates natural-language descriptions using a pretrained vision–language model and aligns the visual encoder with textual structure through feature alignment and entropy distribution alignment. These mechanisms encourage the visual backbone to internalize semantic regularities during training. Importantly, D2S employs text only during training and maintains a vision-only inference pipeline with no additional computational overhead. Experiments show that D2S achieves state-of-the-art performance on the IC9600 benchmark and remains competitive on no-reference image quality assessment (NR-IQA) tasks. Additional studies demonstrate robustness across captioners and prompt designs, and improved semantic transferability on downstream probing tasks, highlighting the effectiveness of multimodal guidance for complexity-related modeling.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1036
Loading