Keywords: multimodal large language models, requirement-aware reasoning, reinforcement learning
TL;DR: Current MLLMs often fail at requirement-aware reasoning. We introduce an RL method that trains MLLMs to prioritize must-haves before nice-to-haves, improving both accuracy and generalization.
Abstract: Recent progress in multimodal large language models (MLLMs) has fueled significant enthusiasm in their potential to act as autonomous agents for real-world tasks. However, scenarios requiring agents to fulfill users’ complex, structured requirements remain largely underexplored. In this work, we examine reasoning tasks under three distinct requirement scenarios, each defined by the feasible solution set delineated by must-have and nice-to-have requirements: (i) Must-have requirements uniquely determine a unique feasible solution; (ii) Multiple candidate solutions satisfy the must-have requirements and are prioritized via the nice-to-have requirements; and (iii) No candidate solution satisfies the must-have requirements, in which case the agent should abstain from generating a response. We evaluate state-of-the-art MLLMs on 3,649 carefully constructed problems that reflect realistic service scenarios, including e-commerce platforms, booking systems, and map-based or ride-hailing applications. Our evaluation reveals that existing MLLMs exhibit catastrophic failures in all scenarios. Specifically, these models frequently misinterpret task requirements, violate must-have requirements, and produce invalid solutions. To address this critical gap, we propose First Things First Reinforcement Learning (FTF-RL) that explicitly optimizes reasoning over multi-priority user requirements. Experimental results show that our method substantially improves the task success rate compared to strong baselines. Moreover, FTF-RL yields general effectiveness on popular logical and mathematical reasoning tasks, including LogicVista, MathVision, and MathVista. Our finding suggests that enhancing requirement comprehension provides a simple yet effective pathway toward improving the broad generalization of MLLMs. Code and evaluation data are available at anonymity.
Primary Area: reinforcement learning
Submission Number: 23507
Loading