Keywords: vision language model, multimodal, reward model
Abstract: Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a **T**oken-**L**evel **D**etective **R**eward Model (**TLDR**) to provide fine-grained annotations to each text token. We first introduce a perturbation-based model to generate synthetic hard negatives for training TLDR models. Then we show the rich usefulness of TLDR models in assisting off-the-shelf models to self-correct their generations, in serving as a hallucination evaluation tool, and in improving the backbone VLM through token-level likelihood optimization. Finally, we show that TLDR models can significantly speed up human annotation to acquire a broader range of high-quality vision language data.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1023
Loading