Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

Adyasha Maharana; Amita Kamath; Christopher Clark; Mohit Bansal; Aniruddha Kembhavi

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

Adyasha Maharana, Amita Kamath, Christopher Clark, Mohit Bansal, Aniruddha Kembhavi

Published: 08 Mar 2024, Last Modified: 09 May 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Event Certifications: iclr.cc/ICLR/2025/Journal_Track

Abstract: As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support. Inconsistent AI models are considered brittle and untrustworthy by human users and are more challenging to incorporate into larger systems that take dependencies on their outputs. Measuring consistency between very heterogeneous tasks that might include outputs in different modalities is challenging since it is difficult to determine if the predictions are consistent with one another. As a solution, we introduce a benchmark dataset, CocoCON, where we create contrast sets by modifying test instances for multiple tasks in small but semantically meaningful ways to change the gold label and outline metrics for measuring if a model is consistent by ranking the original and perturbed instances across tasks. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks, especially for more heterogeneous tasks. To alleviate this issue, we propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets, that improves the multi-task consistency of large unified models while retaining their original accuracy on downstream tasks.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: The following changes have been made in response to reviewers' comments since the last revision. * **Note on the use of MLLMs**: Based on reviewer comments, we have added a note that mentions that our work precedes the availability of open-source MLLMs and we provide examples of how MLLMs can be leveraged to create contrast sets in the Appendix. * **Additional discussion in Limitations**: We have expanded the Limitations section with additional discussion on extending CocoCON to different tasks, and on aggregating likelihood scores from semantically similar outputs.

Assigned Action Editor: ~Marcus_Rohrbach1

Submission Number: 1580

Loading