Keywords: Human-Centered AI, Computer Vision, Human-AI Alignment, Large Language Model
TL;DR: LVLM-Aided Visual Alignment (LVLM-VA) aligns small task-specific vision models with human domain knowledge, improving model performance and reducing the need for extensive fine-grained feedback by leveraging a Large Vision Language Model.
Abstract: In high-stakes domains, small task-specific models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result into brittle behaviour once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and human class-level instructions into image-level critiques, enabling effective interaction between domain experts and the model. We show that our method improves model performance whilst drastically reducing the need for extensive fine-grained feedback.
Submission Type: Short Paper (4 Pages)
Archival Option: This is a non-archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 82
Loading