FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

Published: 13 May 2025, Last Modified: 13 May 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly utilize Reinforcement Learning (RL) to align modalities in LVLMs. However, they still suffer from three main limitations: (1) General feedback can not indicate the hallucination type contained in the response; (2) Sparse rewards only give the sequence-level reward for the whole response; and (3)Annotation cost is time-consuming and labor-intensive. To handle these limitations, we propose an innovative method to align modalities in LVLMs through \textbf{F}ine-\textbf{G}rained \textbf{A}rtificial \textbf{I}ntelligence \textbf{F}eedback (\textbf{\ours}), which mainly consists of three steps: AI-based Feedback Collection, Fine-grained Reward Model Training, and Reinforcement Learning with Fine-grained Reward. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm. Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=NodjG4psr2
Changes Since Last Submission: > **1: Update all codes in the provided github repository.** - We have released our code via an anonymous link https://anonymous.4open.science/r/FGAIF-C208/README.md. > **2: Add much more exact details to explain each subroutine of the algorithm.** - We add algorithm details/pseudocode in Appendix D. > **3: Discuss the relative importance of different hallucinations.** - In our ablation study, to investigate the relative importance of different types of hallucinations, we devised three variant methods (1) w/o-Obj; (2) w/o-Att; and (3) w/o-Rel. From the experimental results in Table 4, we can see that w/o-Obj performs worse than w/o-Att and w/o-Rel. This indicates that the object existence hallucination is most important. The potential reason is that there are more atomic facts of object existence, compared to the other hallucinations. w/o-Att show a similar performance with w/o-Rel, which show w/o-Att has a similar importance with w/o-Rel. We added a discussion in Section 5.3. > **4: Report the time to fine-tune the model using the proposed method and compare it with naïve fine-tuning.** - We conduct experiments on a server with 4 NVIDIA A100 GPUs. The time to fine-tune the model using the proposed method and the naïve fine-tuning is about 36 and 31 hours, respectively. > **5: Empirical evidence of performance on different numbers of segments.** - We added these in Section 5.3 to further investigate this problem. To further evaluate the robustness of our reward model across responses with varying sub-segment counts, we constructed an additional test dataset. Specifically, we observed that the number of sub-segments in the training set ranged from [4, 16], while the original test set (comprising 500 samples) covered a range of [6, 15]. To assess model performance on longer responses, we collected an additional 200 samples with sub-segment counts between [15, 20]. Our evaluation results indicate an accuracy of 80.4\% on the original test set and 79.2\% on the newly constructed dataset. The comparable performance across different response lengths suggests that the impact of sub-segment length on model accuracy is minimal, demonstrating the robustness of our model. > **6: The influence of the incorrect hallucination label.** - Sometimes ChatGPT and LLaVA models could give incorrect answers. However, the experimental results in our paper demonstrate that our method achieves superior performance, even in the presence of errors from ChatGPT and LLaVA. Therefore, the influence is not fatal. We also added error cases in the Appendix C of our paper. These cases may mislead the model to learn the wrong reward and mislead the policy model. > **7: Empirical evidence of performance on different objects.** - In this paper, we focus on three aspects of hallucinations, i.e., object existence, object attribute, and object relationship. In fact, we detected fixed types of hallucination and every type of hallucination contains an unlimited number of hallucination labels. Therefore, in the testing set, our method also could detect these three kinds of hallucinations, including new labels. - During training, a fixed set of hallucination labels is introduced. A key question is whether the method can generalize to new objects that were not present in the training set. To assess the sensitivity of our approach to different object types, we followed prior works and constructed a dedicated out-of-the-distribution test set based on the Foggy dataset. Specifically, we sampled 200 images from the Foggy test set to evaluate the reward model’s performance in this setting. The results show an accuracy of 76\%, which is lower than the original test set but remains within an acceptable range. This suggests that while the model experiences some performance degradation when encountering unseen object types, it still maintains reasonable robustness. We added these in Section 5.3
Assigned Action Editor: ~Hanwang_Zhang3
Submission Number: 4265
Loading