DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks

DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks

ICLR 2026 Conference Submission20640 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: fine-tuning, multimodal large language model, domain adaptation, visual perception

Abstract: Fine-tuning large language models (LLMs) using reinforcement learning (RL) objectives has gained traction, especially in scenarios where labeled data is limited. Building on its success in the language domain, recent efforts have extended RL-based fine-tuning to multimodal tasks. Visual-RFT, for instance, applied Group Relative Policy Optimization (GRPO) to fine-tune multimodal LLMs (MLLMs) across various visual perception benchmarks, achieving notable improvements over standard supervised fine-tuning (SFT). However, its scope was limited by a narrow evaluation of RL adaptation strategies. In this work, we expand the landscape by introducing new RL-based baselines on the same benchmarks and conducting a deeper analysis of GRPO’s training dynamics. We identify key limitations—such as reduced generation diversity, constrained policy exploration, and suboptimal reward formulation and aggregation. To address these, we propose DEVA: a framework that enhances Diversity via a flow-based training objective, encourages broader policy Exploration through global entropic regularization, and leverages alignment Volume as a non-verifiable reward combined with harmonic Aggregation. Applied to GRPO and other RL methods, DEVA delivers consistent gains in both quantitative (+5 to +13 points) and qualitative metrics. We further provide visualizations, ablations, and analyses to unpack the contributions of each component in our framework.

Supplementary Material: pdf

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 20640

Loading