Towards LVLM-Aided Alignment of Task-Specific Vision Models

Published: 06 Mar 2025, Last Modified: 05 May 2025ICLR 2025 Bi-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human-Centered AI, Computer Vision, Human-AI Alignment, Large Language Model
TL;DR: LVLM-Aided Visual Alignment (LVLM-VA) aligns small task-specific vision models with human domain knowledge, improving model performance and reducing the need for extensive fine-grained feedback by leveraging a Large Vision Language Model.
Abstract: In high-stakes domains, small task-specific models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result into brittle behaviour once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and human class-level instructions into image-level critiques, enabling effective interaction between domain experts and the model. We show that our method improves model performance whilst drastically reducing the need for extensive fine-grained feedback.
Submission Type: Short Paper (4 Pages)
Archival Option: This is a non-archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 82
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview