Keywords: Embodied Agent, Multi-modal Large Model, Chain-of-Thought, Planning, Reinforcement Learning, Preference Learning
Abstract: Vision-Language Models (VLMs) are increasingly being employed as the decision-making "brains" of embodied agents. Effectively harnessing their powerful generalization capabilities in dynamic, context-specific tasks remains a significant challenge. Chain-of-Thought (CoT) prompting is often utilized for complex task execution, but existing methods either rely on static strategies that fail to adapt to changing environments or fine-tune on offline datasets, which are insufficient for optimizing agent decision-making through interaction.
In this paper, we propose a novel approach that focuses on optimizing the CoT reasoning process rather than just the final action tokens. By aligning the CoT process through preference-based reinforcement learning, specifically Direct Preference Optimization (DPO), we enhance the agent's ability to make accurate decisions in dynamic environments while mitigating model degradation during fine-tuning. Our method models the environment as a Markov decision process, requiring the agent to reflect on the current state in real time to generate adaptive plans and actions.
By prioritizing the optimization of the CoT process over the final actions, we enhance the agent's reasoning adaptability while effectively mitigating model degradation during fine-tuning.
Experiments in the ALFWorld environment demonstrate an average success rate of \textbf 26.67%, which is a 6\% improvement over RL4VLM, and show that our method effectively mitigates model degradation post fine-tuning. These results highlight the potential of integrating preference-based reinforcement learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3653
Loading