Group-Relative Visual Discrimination Enhancement for Unlocking Intrinsic Capability of MLLMs

Fang Peng, Xiaoshan Yang, Yaowei Wang, Changsheng Xu

Published: 01 Jan 2026, Last Modified: 27 Jan 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Although Multimodal Large Language Models (MLLMs) have shown remarkable generalization across diverse vision-language tasks, recent studies reveal their limitations in visual discrimination. These challenges arise not from insufficient model capacity, but from existing training paradigms that favor linguistic priors over detailed visual analysis. While existing approaches address this limitation through external interventions such as feature integration or knowledge augmentation, we propose a Group-Relative Visual Discrimination Enhancement framework to unlock intrinsic capability of MLLMs and requires no external resources. Our method introduces a Group-Relative Reinforcement Learning paradigm equipped with a lightweight Visual Patch Selection Plugin to dynamically select discriminative visual tokens. The framework establishes a self-feedback loop between visual encoder and language decoder, leveraging the dual reward-penalty signals derived from the model’s internal language feedback to optimize the visual focus, thereby enhancing the model’s visual discrimination capabilities. Extensive experimental results across six visual recognition benchmarks and two VQA benchmarks demonstrate the effectiveness of our method. Code is available at https://github.com/FannierPeng/GROVE.

External IDs:doi:10.1109/tcsvt.2026.3652189