Harnessing Multimodal LLMs for Attribute Specific Image Retrieval

Harnessing Multimodal LLMs for Attribute Specific Image Retrieval

ACL ARR 2026 January Submission6191 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Image Retrieval

Abstract: Despite extensive research on visual style understanding, accurately capturing and comparing artistic style remains challenging. Traditional retrieval methods rely on global embeddings that conflate multiple stylistic dimensions—such as brushstrokes, color palette, lighting, and composition—into single vectors, limiting both interpretability and fine-grained retrieval. We present a training-free framework that leverages Multimodal Large Language Models (MLLMs) to decompose images into multiple attribute-specific embeddings, providing disentangled style representation. Independent representations for each stylistic attribute are obtained by extracting embeddings from intermediate transformer layers, conditioned on carefully designed user and system prompts that are generated dynamically. We demonstrate that intermediate MLLM layers possess richer visual information than the final-layer, which tends to suppress visual features to generate textual outputs. To enhance representational quality, we introduce \emph{dynamic prefixing}, an automated approach that generates task-adaptive system and user prompts that outperform manual prompt design. For retrieval, we propose a layer-wise hard-voting fusion mechanism that aggregates evidence across multiple transformer layers without learnable parameters. Extensive experiments on WikiArt and DomainNet demonstrate that our training-free approach achieves competitive or superior performance compared to both generic vision encoders and style-specific models like CSD~\cite{somepalli2024measuring}, while providing attribute-level interpretability. Human preference studies further validate that our attribute-based retrieval aligns more closely with human perception than global representations, being preferred \ours of the time over images retrieved by contemporary methods.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Artistic style representation, attribute-specific embeddings, Multimodal Large Language Models, training-free image retrieval, dynamic prefixing

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 6191

Loading