Multi-Objective Task-Aware Predictor for Image-Text Alignment

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodality, benchmarks, datasets
TL;DR: We introduce an effective LVLM-based predictor aligned with human judgment on image-text pairs and image-text datasets.
Abstract: Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) $\textit{Alignment with human judgments}$, (2) $\textit{Long-sequence processing}$, (3) $\textit{Inference efficiency}$, and (4) $\textit{Applicability to multi-objective scoring}$. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, $\texttt{MULTI-TAP}$ ($\textbf{Multi}$-Objective $\textbf{T}$ask-$\textbf{A}$ware $\textbf{P}$redictor), capable of both multi and single-objective scoring. $\texttt{MULTI-TAP}$ can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that $\texttt{MULTI-TAP}$ is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics ($\textit{e.g.}$, +42.3 Kendall's $\tau_{c}$ compared to IXCREW-S on FlickrExp) and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7$-$8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, $\texttt{MULTI-TAP}$ can produce fine-grained scores for multiple human-interpretable objectives. $\texttt{MULTI-TAP}$ performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, $\texttt{EYE4ALL}$. Our new dataset, consisting of chosen/rejected human preferences ($\texttt{EYE4ALLPref}$) and human-annotated fine-grained scores across seven dimensions ($\texttt{EYE4ALLMulti}$), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals. Our contributions can guide future research for developing human-aligned predictors.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1814
Loading